The Data Lake Is A Data System Of Large Volumes Of Unstructured And Structured Data

1557 Words7 Pages
1 Data Lake A data lake is a massive, easily usable, centralized repository of large volumes of unstructured and structured data. The data lake approach is a ‘store-everything’ approach to big data. Data is not classified when the data is stored in the repository, so the value of the data is not unlocked. A data lake is unstructured when compared to a data warehouse. (‘Data Lake’, 2015) 1.2 Hadoop Hadoop is an open-source framework which is used for processing and analyzing big data. It consists of a Hadoop Distributed File system and MapReduce. (‘Data Lake’, 2015) 1.3 Hadoop Data Lake Hadoop Data Lake is a data management platform. Currently, Cummins has a Data lake environment in Hadoop which stores data from Supply chain and…show more content…
ETL processing runs in parallel across the entire cluster resulting in much faster operations than can be achieved pulling data from a Storage Area Network into a collection of ETL servers. (‘CITO Research’, 2014). MapReduce refers to the application modules written by a programmer that run in two phases: first mapping the data (extract) then reducing it (transform). Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers. Hadoop also has YARN (Yet another Resource Navigator) manages the clusters, and MapReduce performs data processing which helps in faster processing. (‘CITO Research’, 2014). MapReduce is a framework on which you can execute programs written on Hadoop. (Programs are written on Pig and Hive). These programs are executed and analysis is done. Further, if we need to do statistical modeling, a tool called ‘R’ or SAS can be used. Example: Sensor data is stored in HDFS. After doing analytics, it may be found that the system has been giving warnings about failure. R can then be used to make future predictions. 1.4 Hadoop Optimization Cummins stores data from Distribution Business Unit, Power Generation Business Unit, Components Business Unit, and Corporate in the Data Warehouse. Storing such large volumes of data in the data warehouse would be expensive. The performance of the data warehouse would
Open Document