Analyzing The Data Processing On A Large Cluster

2643 Words11 Pages
1.Hive Introduction: Business analysts, Data scientists, Data analysts, Statisticians, want to analyze the data for collecting the important characteristics of the data set. In 2004 Google introduced MapReduce. It simplified the data processing on a large cluster. But many organizations have only few developers who can write good MapReduce code, which is written in java. (MapReduce can be written in other languages). Hive was originally developed at Facebook. In 2007 in facebook, the data processing infrastructure was built using commercial RDBMS. The data that facebook generating was growing very fast. Today facebook generating near 2PB data sets everyday. So daily processing jobs were taking more than a day to process on that data set.…show more content…
1.1.Hive Vs. Relational RDBMS: Hive is a schema on read. In traditional database, before data loading, the table’s data has to be decided. If the data being loaded doesn’t match with the schema, the data will be rejected. Traditional database is schema on write. On other hand Hive doesn’t verify the data while loading the data into the tables. But while issuing the query, Hive will test the data schema with the table schema. The load operation is just a file copy or file move. But while reading log files, during table creation we can use RegexSerDe. It uses regular expression to serialize/deserialize. The Regex deserializer interface converts the string or binary representation of the data set into a java object that Hive can manipulate. Serializer will convert the java object that Hive has been working with into something that Hive can write to HDFS or another supported system. 2.Hive Architecture: Fig 2.1 Hive Architecture Fig 2.1 shows the main components of Hive. The major components are External Interfaces Hive provides both user interfaces and Application Programming Interfaces. User inter-faces like command line (CLI) and web UI, and application programming interfaces (API) like JDBC and ODBC. Hive Thrift server: The Apache Thrift is software framework, for scalable cross-language services development. It has a software stack with a code generation engine to build services that work efficiently between
Open Document