This chapter introduces an industrial and technical review for Hadoop framework with other technologies used with Hadoop system to process bigdata. Hadoop project originally was built and supervised by Apache community. In addition to Apache many other companies whose businesses run on Hadoop are adding more interesting features to Hadoop, some of them announced their own Hadoop distributions replying on the original core distribution distributed by Apache.
2.1 Industry Feedback
In last Hadoop summit [19] Mike Gualtieri ‘principal analyst at Forrester’ gave a keynote about the market today’s expectations from bigdata analysis solutions, he came up with these interesting results:
“Data-related projects are at the forefront of the minds of
…show more content…
Chris Twogood from Teradata [50] and Timothy Mallalieu from Microsoft talked about how important integrating Hadoop with traditional data technologies to get the most out of the new opportunities.
Arun Murthy who is a co-founder of Hortonworks [51] gave an overview of the key recent advances in Hadoop2.0, including the benefits of the YARN ResourceManager [18], and Apache STORM [73] to process data streaming with Hadoop.
The list of companies and institutions are using or planning to switch to Hadoop is getting longer every day. Adobe, Amazon, IBM, Pivotal, Google, Facebook, Twitter, CloudERA, MapR, Hortonworks, Dell, Intel, HSBC, Deutsche Telecom, ...etc are just few examples of many organizations switched or just announced they are switching to Hadoop.
2.2 Previous Work and Critique
2.2.1 Data Warehouse Systems DWHs
Data Warehouse Systems “DWHs” are platforms used to report and analyse the data by integrating data from different disparate sources and creating a central repository of data "data warehouse". This is achieved through storing the current and the historic data to create reports for management or technical reporting based on some pre-defined configurations. Currently nearly every discussion recently of big data analysis begins with a debate over what is the best way to perform the analysis, whether the DWHs will be replaced by the new frameworks like Hadoop or not. Some went very far thinking relational databases are going to
Hadoop \cite{white2012hadoop} is an open-source framework for distributed storage and data-intensive processing, first developed by Yahoo!. It has two core projects: Hadoop Distributed File System (HDFS) and MapReduce programming model \cite{dean2008mapreduce}. HDFS is a distributed file system that splits and stores data on nodes throughout a cluster, with a number of replicas. It provides an extremely reliable, fault-tolerant, consistent, efficient and cost-effective way to store a large amount of data. The MapReduce model consists of two key functions: Mapper and Reducer. The Mapper processes input data splits in parallel through different map tasks and sends sorted, shuffled outputs to the Reducers that in turn groups and processes them using a reduce task for each group.
MapReduce Parallel programming model if we ever get a chance. In Hadoop, there are two nodes in the cluster when using the algorithm, Master node and Slave node. Master node runs Namenode, Datanode, Jobtracker and Task tracker processes. Slave node runs the Datanode and Task tracker processes. Namenode manages partitioning of input dataset into blocks and on which node it has to store. Lastly, there are two core components of Hadoop: HDFS layer and MapReduce layer. The MapReduce layer read from and write into HDFS storage and processes data in parallel.
The Hadoop employs MapReduce paradigm of computing which targets batch-job processing. It does not directly support the real time query execution i.e OLTP. Hadoop can be integrated with Apache Hive that supports HiveQL query language which supports query firing, but still not provide OLTP tasks (such as updates and deletion at row level) and has late response time (in minutes) due to absence of pipeline
An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by simply adding commodity servers. Hadoop clusters at Yahoo! span 40,000 servers, and store 40 petabytes of application data, with the largest cluster
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a Parallel and distributed computing environment. It makes Use of the commodity hardware Hadoop is Highly Scalable and Fault Tolerant. Hadoop runs in cluster and eliminates the use of a Super computer. Hadoop is the widely used big data processing engine with a simple master slave setup. Big Data in most companies are processed by Hadoop by submitting the jobs to Master. The Master distributes the job to its cluster and process map and reduce tasks sequencially.But nowdays the growing data need and the and competition between Service Providers leads to the increased submission of jobs to the Master. This Concurrent job submission on Hadoop forces us to do Scheduling on Hadoop Cluster so that the response time will be acceptable for each job.
In an attempt to manage their data correctly, organizations are realizing the importance of Hadoop for the expansion and growth of business. According to a study done by Gartner, an organization loses approximately 8.2 Million USD annually through poor data quality. This happens when 99 percent of the organizations have their data strategies in place. The reason behind this is simple – the organizations are unable to trace the bad data that exists within their data. This is one problem which can be easily solved by adopting Hadoop testing methods which allows you to validate all of your data at increased testing speeds and boosts your data coverage resulting in better data quality.
big data implementations this vast ocean of data can possibly be the source of our stronghold to the
Cost reduction: Big data technologies such as Hadoop and cloud based analytics bring significant cost advantages when it comes to storing large amounts of data – plus they can identify and implement more efficient ways of doing business.
In your business, you have your own big data challenges. You have to turn heaps of data about various entities into actionable information. The reporting needs of institutions have evolved from simple single subject queries to data discovery and enterprise-wide analysis that tells a complete story across the institution. While the volume, variety and velocity of big data seem overwhelming, big data technology solutions hold great promise. The way I see it we can use this as one of the biggest asset for the company. We have the capacity to see patterns recounting in real time across complex systems. Huron is marshalling its resources to bring smarter computing to big data. With the Huron big data platform, we are enabling our clients to manage data in ways that were never thought possible before.
Every organization has many projects, and you rely heavily on technology to solve your business needs. You have probably heard to focus on a project that gives you the big bang for your buck. For Hadoop, Identify identify a small project with an immediate need for Hadoop. Too many times I have witnessed projects fail in companies due to the level of complexity, lack of resources, and high expenses. By sSelecting a small project to get started which allows the IT and business staffs to become familiar with the interworking of this emerging technology. The beauty of Hadoop is that it allows you to start small and add nodes as you go.
Today, data is a growing asset that various businesses are having difficulty converting into a powerful strategic tool. Companies need help turning this data into valuable insight, which can diminish risk and enhance returns on investments. Companies are struggling to make sense and obtain value from their big data. Superior and reliable
• Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
IBM, Dell and HP lead the list of Big Data companies with the most revenue from big data hardware, software, and IT services. They all produce and develop their own servers which involve in big data hardware and software. They expect their servers to provide the highest speed when processing data. But there are still 3 missions they need to take consideration——enhancing security,increasing performance and improving usability.
Over the years it has become very essential to process large amounts of data with high precision and speed. This large amounts of data that can no more be processed using the Traditional Systems is called Big Data. Hadoop, a Linux based tools framework addresses three main problems faced when processing Big Data which the Traditional Systems cannot. The first problem is the speed of the data flow, the second is the size of the data and the last one is the format of data. Hadoop divides the data and computation into smaller pieces, sends it to different computers, then gathers the results to combine them and sends it to the application. This is done using Map Reduce and HDFS i.e., Hadoop Distributed File System. The data node and the name node part of the architecture fall under HDFS.
Data has always been analyzed within companies and used to help benefit the future of businesses. However, the evolution of how the data stored, combined, analyzed and used to predict the pattern and tendencies of consumers has evolved as technology has seen numerous advancements throughout the past century. In the 1900s databases began as “computer hard disks” and in 1965, after many other discoveries including voice recognition, “the US Government plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.” The evolution of data and how it evolved into forming large databases continues in 1991 when the internet began to pop up and “digital storage became more cost effective than paper. And with the constant increase of the data supplied digitally, Hadoop was created in 2005 and from that point forward there was “14.7 Exabytes of new information are produced this year" and this number is rapidly increasing with a lot of mobile devices the people in our society have today (Marr). The evolution of the internet and then the expansion of the number of mobile devices society has access to today led data to evolve and companies now need large central Database management systems in order to run an efficient and a successful business.