The Data Lake Is A Data System Of Large Volumes Of Unstructured And Structured Data

Good Essays

1 Data Lake
A data lake is a massive, easily usable, centralized repository of large volumes of unstructured and structured data. The data lake approach is a ‘store-everything’ approach to big data. Data is not classified when the data is stored in the repository, so the value of the data is not unlocked. A data lake is unstructured when compared to a data warehouse. (‘Data Lake’, 2015)

1.2 Hadoop
Hadoop is an open-source framework which is used for processing and analyzing big data. It consists of a Hadoop Distributed File system and MapReduce. (‘Data Lake’, 2015)

1.3 Hadoop Data Lake
Hadoop Data Lake is a data management platform. Currently, Cummins has a Data lake environment in Hadoop which stores data from Supply chain and …show more content…

ETL processing runs in parallel across the entire cluster resulting in much faster operations than can be achieved pulling data from a Storage Area Network into a collection of ETL servers. (‘CITO Research’, 2014).
MapReduce refers to the application modules written by a programmer that run in two phases: first mapping the data (extract) then reducing it (transform). Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers. Hadoop also has YARN (Yet another Resource Navigator) manages the clusters, and MapReduce performs data processing which helps in faster processing. (‘CITO Research’, 2014).
MapReduce is a framework on which you can execute programs written on Hadoop. (Programs are written on Pig and Hive). These programs are executed and analysis is done. Further, if we need to do statistical modeling, a tool called ‘R’ or SAS can be used.
Example: Sensor data is stored in HDFS. After doing analytics, it may be found that the system has been giving warnings about failure. R can then be used to make future predictions.

1.4 Hadoop Optimization
Cummins stores data from Distribution Business Unit, Power Generation Business Unit, Components Business Unit, and Corporate in the Data Warehouse. Storing such large volumes of data in the data warehouse would be expensive. The performance of the data warehouse would

Get Access

The Data Lake Is A Data System Of Large Volumes Of Unstructured And Structured Data

Nt1330 Unit 1 Problem Analysis Paper

Nt1330 Unit 1 Problem Analysis Paper

Nt1310 Unit 3 Data Analysis Paper

Nt1310 Unit 3 Data Analysis Paper

Nt1330 Unit 3 Problem Analysis Paper

Nt1330 Unit 3 Problem Analysis Paper

Nt1330 Unit 5 Algorithm

Nt1330 Unit 5 Algorithm

Nt1330 Unit 3 Problem Analysis Paper

Nt1330 Unit 3 Problem Analysis Paper

My Curriculum Practical Training : The Whole Concepts Of Hadoop Technology Essay

My Curriculum Practical Training : The Whole Concepts Of Hadoop Technology Essay

Business Analysis : Large Amounts Of Data Essay

Business Analysis : Large Amounts Of Data Essay

Hadoop Security Issues

Hadoop Security Issues

Pavlo Comparison Essay

Pavlo Comparison Essay

Global Data And Its Attendant Complexities Has Spawned A Whole Ecosystem

Global Data And Its Attendant Complexities Has Spawned A Whole Ecosystem

Altiscale-Technology : The Future Of The Internet

Altiscale-Technology : The Future Of The Internet

Hadoop Distributed File System Analysis

Hadoop Distributed File System Analysis

The Evolution Of The Data Stored Essay

The Evolution Of The Data Stored Essay

Investigation Into An Efficient Hybrid Model Of A With Mapreduce + Parallel Platform Data Warehouse Architecture Essay

Investigation Into An Efficient Hybrid Model Of A With Mapreduce + Parallel Platform Data Warehouse Architecture Essay

Compare And Contrast An Apache Spark Data Set With A Data Sheet?

Compare And Contrast An Apache Spark Data Set With A Data Sheet?

Related Topics