Chapter 7
IMPLEMENTATION
The implementation phase of the project is where the detailed design is actually transformed into working code. Aim of the phase is to translate the design into a best possible solution in a suitable programming language. This chapter covers the implementation aspects of the project, giving details of the programming language and development environment used. It also gives an overview of the core modules of the project with their step by step flow.
The implementation stage requires the following tasks.
• Planning has to be done carefully.
• Examination of system and the constraints.
• Design of methods.
• Evaluation of the method.
• Correct decisions regarding selection of the platform.
• Appropriate language selection
…show more content…
The file system that manages the storage across network of machines is called distributed file systems. Hadoop mainly comes with the distributed file system called HDFS (Hadoop distributed file system).
HDFS Design: The HDFS file system is designed for storing files which are very large means files that are hundreds of megabytes, gigabytes and terabytes in size, with streaming data access patterns, running on commodity hardware clusters. HDFS has a idea of write once read many times pattern. A dataset is typically generated or copied from the source and various analyses are performed on that dataset. And hadoop does not need expensive hardware. It is designed to run on commodity hardware.
7.1.1 Basic Architecture of HDFS
Figure 7.1.1 shows the basic architecture of NS2. NS2 provides users with executable command ns which take on input argument, the name of a Tcl simulation scripting file. Users are feeding the name of a Tcl simulation script (which sets up a simulation) as an input argument of an NS2 executable command ns. In most cases, a simulation trace file is created, and is used to plot graph and/or to create animation. Fig 7.1.1: Basic Architecture of
When a file is written in HDFS, it is divided into fixed size blocks. The client first contacts the NameNode, which get the list of DataNode where actual data can be stored. The data blocks are distributed across the Hadoop cluster. Figure \ref{fig.clusternode} shows the architecture of the Hadoop cluster node used for both computation and storage. The MapReduce engine (running inside a Java virtual machine) executes the user application. When the application reads or writes data, requests are passed through the Hadoop \textit{org.apache.hadoop.fs.FileSystem} class, which provides a standard interface for distributed file systems, including the default HDFS. An HDFS client is then responsible for retrieving data from the distributed file system by contacting a DataNode with the desired block. In the common case, the DataNode is running on the same node, so no external network traffic is necessary. The DataNode, also running inside a Java virtual machine, accesses the data stored on local disk using normal file I/O
HDFS uses NameNode operation to realize data consistency. NameNodes utilizes a transactional log file to record all the changes of
GFS: Google File System is a distributed file system which is developed by Google in order to provide efficient, reliable access to data. . It is designed and implemented inorder to meet the requirements provided by Google’s data processing. The file system consists of hundreds of storage machines to provide inexpensive parts and it is accessed by different client machines. Here the search engine is providing huge amounts data that should be stored. GFS has 1,000 nodes with 300TB disk storage.
HDFS is Hadoop’s distributed file system that provides high throughput access to data, high-availability and fault tolerance. Data are saved as large blocks making it suitable for applications
Hadoop1 provides a distributed filesystem and a framework for the analysis and transformation of very large data sets using the MapReduce [DG04] paradigm. While the interface to HDFS is patterned after the Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.
HFS+ is file system developed by apple to replace their Hierarchical file system as the primary file system used in Mac computers It is also used by IPod and it is referred to as Mac OS extended.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a Parallel and distributed computing environment. It makes Use of the commodity hardware Hadoop is Highly Scalable and Fault Tolerant. Hadoop runs in cluster and eliminates the use of a Super computer. Hadoop is the widely used big data processing engine with a simple master slave setup. Big Data in most companies are processed by Hadoop by submitting the jobs to Master. The Master distributes the job to its cluster and process map and reduce tasks sequencially.But nowdays the growing data need and the and competition between Service Providers leads to the increased submission of jobs to the Master. This Concurrent job submission on Hadoop forces us to do Scheduling on Hadoop Cluster so that the response time will be acceptable for each job.
The larger blocks offer several advantages over smaller blocks. It reduces the number of interactions with NameNode and also reduces a size of metadata that needs to be stored in the NameNode. It reduces extra network overheads by keeping a persistent TCP connection to the DataNode. In Figure \ref{fig.executionTime}, the execution time for writing and deleting the larger blocks is almost similar to the smaller blocks. However, HDFS blocks are larger in comparison to disk blocks, in order to minimize the cost of seeks. Thus, in our case, all the interaction of deletion is done through a checkerNode. The checkerNode sends the delete command, and rest is handled by overwriting daemon in each nodes. The overwrite daemon read the metadata from inodes, it does not need to make a connection with NameNode or checkerNode. Thus, it reduces the extra network overhead and boost the performance of deletion
Listed below are the concepts that should be learned in order to properly understand HADOOP technology.
The MapReduce programming structure utilizes two undertakings normal in functional programming: Map and Reduce. MapReduce is another parallel preparing structure and Hadoop is its open-source usage on Clusters.
Hadoop is one of the most popular technologies for handling Big Data as it is entirely open source. One of the reasons why Hadoop is used is because it is flexible enough to be able to work with multiple data sources. The multiple data sources can be combined in order to enlarge scaling processing and it can run processor intensive machine learning jobs through reading data from a database says Rodrigues in his article on Big Data. He states that Hadoop has many different applications but one that it excels in is being able to handle large volumes of data that are constantly changing.This is extremely good as it receives location based data from traffic devices and weather satellites. They also work with social media data and web-based data as
Abstract - Hadoop Distributed File System, a Java based file system provides reliable and scalable storage for data. It is the key component to understand how a Hadoop cluster can be scaled over hundreds or thousands of nodes. The large amounts of data in Hadoop cluster is broken down to smaller blocks and distributed across small inexpensive servers using HDFS. Now, MapReduce functions are executed on these smaller blocks of data thus providing the scalability needed for big data processing. In this paper I will discuss in detail on Hadoop, the architecture of HDFS, how it functions and the advantages.
Abstract-The amount of data being generated nowadays has increased tremendously. With these large volumes of data, there is a need for efficient storage and processing of data. The data generated from a variety of sources like Social networking sites, Emails, audio and video files, text files and other various files is unstructured. The traditional database cannot handle this complex and high volumes of data efficiently. The Hadoop framework provides an effective solution for storing large volumes of data and processing the data for analysis. This paper deals about the Traditional Databases, Big data, Data analysis and how effectively Hadoop stores these large volumes of data and processes data for analysis. Furthermore, this paper also focuses on the various technologies used in Hadoop for the purpose of storing the data and for processing the data.
3. Presents the software description. It explains the implementation of the project using PIC C Compiler software.
Imagine how does the Google’s world of database look like? Therefore, nothing is small because google provide everything a user need to find through the database. GFS is implemented to encounter the rapid growing of demands of Google’s data processing requirements. However, Google have difficulties when it comes to managing large amount of data. Depending on the average number of comparable small servers, GFS is mainly designed as a distributed file system that can be run on clusters for more than a thousands of machines. To ease the GFS application development, the file system includes a programming interface used to abstract the management and distribution aspect. While commodity hardware is being tested, GFS does not only being challenged by managing not only on the distribution but also needed to cope with the increased danger of hardware problems. Developers of GFS made an assumptions during the design of GFS is to consider handling the disk faults, machine faults and network faults as being the model rather than the exception. The key challenges faced by GFS is the security of data while scaling up to more than a thousands of computers while managing the multiple terabytes of data