FACING THE DATABASE CONUNDRUM SQL, HBASE, HIVE, OR SPARK: WITH THE SATURATION OF THE ENVIRONMENT WHICH DO YOU CHOOSE SAMANTHA MOHR UNIVERSITY OF MARYLAND UNIVERSITY COLLEGE SPRING 2015 ABSTRACT There is currently a conundrum facing experts in the field of Big Data. The struggle is the ability to perform large-scale data analysis and the impracticality of using relational database processing languages to handle the information that is collected/processed. Specifically, the growth of data, the sheer volume that must be stored in databases, processed by cloud analytic and queried by applications has led to a growth in the data capacity the needs to be handled. Unfortunately, this exponential growth has exceeded the hardware and …show more content…
Are dynamic columns something you require support for, then you should choose Cassandra. Do you do batch analytic modeling on your data, Hadoop may be the choice for you. For live streaming analytic modeling abilities, Apache Spark is a much better choice. So you want to work with your data as if it were SQL, then you should try Hive. This paper will provide you with a detailed knowledge of how by choosing the correct database processing and query language you are able to mitigate the processing capacity problems that are involved with the vast growth of data recently. This will help to show that while there may be no one size fits all answer, there is a fit for the problem at hand based on the storage, processing, and query needs that are to be met. INTRODUCTION BACKGROUND As a result of the appearance of big data in our world, conventional data warehousing and data analysis methods no longer have the process power needed. What is Big Data you may ask and why is it such a big deal. NIST defines big data as anywhere “[…] data volume, acquisition velocity, or data representation limits the ability to perform effective analysis using traditional relational approaches […]” (Mell & Cooper, n.d.). 1 (Gong, 2012, p.15) Today’s analyst is inundated by an ever growing number of data being created by social media, mobile phones, climate sensors, digital pictures, etc. The volume being generated is staggering (2.7 Zettabytes of data in the digital universe).While
1) Before using any DBMS, the creators should have created a data model from the users' requirements.
Since the 1970’s databases and report generators have been used to aid business decisions. In the 1990’s technology in this area improved. Now technology such as Hadoop has gone another step with the ability to store and process the data within the same system which sparked new buzz about “big data”. Big Data is roughly the collection of large amounts of data – sourced internally or externally - applied as a tool – stored, managed, and analyzed - for an organization to set or meet certain goals.
Since 1960 and beyond the need for an efficient data management and retrieval of data has always been an issue due to the growing need in business and academia. To resolve these issues a number of databases models have been created. Relational databases allow data storage, retrieval and manipulation using a standard Structured Query Language (SQL). Until now, relational databases were an optimal enterprise storage choice. However, with an increase in growth of stored and analyzed data, relational databases have displayed a variety of limitations. The limitations of scalability, storage and efficiency of queries due to the large volumes of data [1] [2].
The ever-widening realm of big data has created an expanding frontier of exploration for the creation of new methods of data analysis in order to produce actionable knowledge for the benefit of organizations everywhere. Companies amass enormous troves of data every day. Keeping this data housed in a fashion that maximizes storage efficiency and in a format optimized for query and analysis is paramount for effective data warehousing. Many database structures exist for the storage, arrangement, and accessing of data, but large databases and online analytical processing (OLAP) benefit from specific qualities. In these databases, compression and rapid querying are the main enabling qualities sought for analytical data stores and data warehouses. Columnar (or column-oriented) relational databases (RDBMS) offer these and other benefits, which is why it is a popular database scheme for analytical systems. Specifically, the vertical arrangement of records is optimal for selecting the sum, average, or a count of total record attributes because one horizontal read yields all values of an attribute. Otherwise, a physical disk must seek over and past unwanted attributes of the records to provide the same
Database research and associated standardization activities have successfully guided the development of database technology over the last four decades and SQL relational databases remain the dominant database technology today. This effort to innovate relational databases to address the needs of new applications is continuing today. Recent examples of database innovation include the development of streaming SQL technology that is 170 George Feuerlicht used to process rapidly flowing data (“data in flight”) minimizing latency in Web 2.0 applications, and database appliances that simplify DBMS deployment on cloud computing platforms. It is also evident from the above discussion that the relational
The invention of relational databases have brought a number of changes to the business world in which they operate specially for the businesses whose prime focus is on its customers, their likes and dislikes to win more market share. There is no such concept as “one size fits all” in using this technology, it varies from industry to industry. One thing may work for some businesses and may not work for others, therefore it is advisable that one should shop around before investing in any of the technologies because it is vital to find an industry-specific solution. One technique to narrow the search for industry-specific solutions is to find out what our competitors are using to gain more customer base.
Data has always been analyzed within companies and used to help benefit the future of businesses. However, the evolution of how the data stored, combined, analyzed and used to predict the pattern and tendencies of consumers has evolved as technology has seen numerous advancements throughout the past century. In the 1900s databases began as “computer hard disks” and in 1965, after many other discoveries including voice recognition, “the US Government plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.” The evolution of data and how it evolved into forming large databases continues in 1991 when the internet began to pop up and “digital storage became more cost effective than paper. And with the constant increase of the data supplied digitally, Hadoop was created in 2005 and from that point forward there was “14.7 Exabytes of new information are produced this year" and this number is rapidly increasing with a lot of mobile devices the people in our society have today (Marr). The evolution of the internet and then the expansion of the number of mobile devices society has access to today led data to evolve and companies now need large central Database management systems in order to run an efficient and a successful business.
This study is focused on HBase database which is a column-oriented NoSql database. HBase is Apache’s open source database that is modeled after Google’s BigTable technology. It uses Java as the API and is developed on top of the Hadoop distributed file system (HDFS) to store and process large quantities of data, maintaining reliability and fault tolerance. This database is being used by many big enterprises including Facebook, Twitter and Yahoo to store and process large quantities of data in efficient and cost effective manner.
Scalability and performance go hand in hand, with companies needing to accommodate new users and data volumes in their line of business applications. Traditional architectures have failed to predict correctly. Ooyala choose Cassandra over
This paper will discuss and make comparisons on the markets top Database Management Systems (DBMS) currently available. The paper includes a table for side-by-side comparisons of feature sets and other factors required when making decisions on which DBMS to purchase and implement in a business. While this may not be a complete list of all available DBMS systems it will include important discussions on aspects required when evaluating any major application / system choice.
RDBMS are intolerable for large data volumes. NoSQL distributed databases, allow data to be spread across thousands of servers and can store large volume of data.
We also studied and compared new emerging NoSQL databases like Cassandra, Accumulo, CouchDB, Hbase, MongoDB etc. to find the best solution for organizations in accordance with their requirements.
Big data is an element that allows companies to leverage high volume data effectively and not in isolation. Big data needs to be quickly accessible and have the ability to be analyzed. Data stores or warehouses are one way data is managed that is persistent, protected and available as long as the data is needed. The forefather to data stores is relational data bases, relational data bases put in place decades ago are still in use today
NoSQL databases, including MongoDB, Redis Labs, Cassandra, and the graph database, Neo4J, have also emerged. Some of these tools run the entire database
Five years ago, few people had heard the phrase ‘Big Data.’ Today, it’s hard to go an hour without seeing it implemented practically in our daily life. The promise of a highly accurate data-driven decision-making tool is an attractive lure for any organization in any industry. However, big data is not without its own problems.