Renu P
Brampton, Ontario
E-mail: renuhp17@gmail.com
Phone No: 647-331-5195
SUMMARY
• 5+ years of extensive IT experience in software Development and Implementation including 1+Years of experience in big data and Hadoop technology stack.
• Constructive experience in installing, configuring and using ecosystem components like Hadoop Map Reduce, HDFS, HBase, Zoo Keeper, Oozie, Hive, Cassandra, Sqoop, Pig, Flume, Avro, Chukwa, Whirr, and Cloudera.
• Excellent Understanding and Hands on experience with Hadoop stacks Technology like Map Reduce, HDFS, Hive, Pig, Sqoop, Impala, Oozie, Scala and Spark.
• Extensive experience working with SQL, Core Java and Linux.
• Having experience on importing and exporting data from different systems like RDBMS
…show more content…
Effective in handling multiple synchronized prioritized tasks with critical thinking, decision-making and problem-solving abilities.
• Excellent analytical, problem solving, communication and interpersonal skills with ability to interact with individuals at all levels and able to work independently or collaboratively.
EXPERIENCE
Hadoop Developer April 2016 to Present
TELUS - Scarborough, ON
Project:
The project aims to move all data from different sources or log data from individual servers to HDFS as per the management system and then perform analysis on these HDFS data sets. Once the data set is inside HDFS, then pig, hive and map reduce are used to perform various analysis.
• Used Pig and hive as ETL tool to do transformations, joins and aggregations before storing the data into HDFS.
• Created Hive tables and loaded data in from Relational Database Management system using Sqoop.
• Extensively worked on creating Hive tables, partitions and buckets for analyzing large volumes of data.
• Creating Partitions and buckets hive tables in Parquet File Format with snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
• Scheduling and managing the Hive, pig and map reduce jobs on Hadoop cluster by using Oozie and Falcon process files.
• For implement business logic to transform the
Hadoop \cite{white2012hadoop} is an open-source framework for distributed storage and data-intensive processing, first developed by Yahoo!. It has two core projects: Hadoop Distributed File System (HDFS) and MapReduce programming model \cite{dean2008mapreduce}. HDFS is a distributed file system that splits and stores data on nodes throughout a cluster, with a number of replicas. It provides an extremely reliable, fault-tolerant, consistent, efficient and cost-effective way to store a large amount of data. The MapReduce model consists of two key functions: Mapper and Reducer. The Mapper processes input data splits in parallel through different map tasks and sends sorted, shuffled outputs to the Reducers that in turn groups and processes them using a reduce task for each group.
MapReduce Parallel programming model if we ever get a chance. In Hadoop, there are two nodes in the cluster when using the algorithm, Master node and Slave node. Master node runs Namenode, Datanode, Jobtracker and Task tracker processes. Slave node runs the Datanode and Task tracker processes. Namenode manages partitioning of input dataset into blocks and on which node it has to store. Lastly, there are two core components of Hadoop: HDFS layer and MapReduce layer. The MapReduce layer read from and write into HDFS storage and processes data in parallel.
The Hadoop employs MapReduce paradigm of computing which targets batch-job processing. It does not directly support the real time query execution i.e OLTP. Hadoop can be integrated with Apache Hive that supports HiveQL query language which supports query firing, but still not provide OLTP tasks (such as updates and deletion at row level) and has late response time (in minutes) due to absence of pipeline
An important characteristic of Hadoop is the partitioning of data and computation across many (thousands) of hosts, and the execution of application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and I/O bandwidth by simply adding commodity servers. Hadoop clusters at Yahoo! span 40,000 servers, and store 40 petabytes of application data, with the largest cluster
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a Parallel and distributed computing environment. It makes Use of the commodity hardware Hadoop is Highly Scalable and Fault Tolerant. Hadoop runs in cluster and eliminates the use of a Super computer. Hadoop is the widely used big data processing engine with a simple master slave setup. Big Data in most companies are processed by Hadoop by submitting the jobs to Master. The Master distributes the job to its cluster and process map and reduce tasks sequencially.But nowdays the growing data need and the and competition between Service Providers leads to the increased submission of jobs to the Master. This Concurrent job submission on Hadoop forces us to do Scheduling on Hadoop Cluster so that the response time will be acceptable for each job.
Research topic was derived from the understanding of query processing in MySQL and Hadoop, the database performance issues, performance tuning and the importance of database performance. Thus, it was decided to develop a comparative analysis to observe the effectiveness of the performance of MySQL (non cluster) and Hadoop in structured and unstructured dataset (Rosalia, 2015). Furthermore, the analysis included a comparison between those two platforms in two variance of data size.
In an attempt to manage their data correctly, organizations are realizing the importance of Hadoop for the expansion and growth of business. According to a study done by Gartner, an organization loses approximately 8.2 Million USD annually through poor data quality. This happens when 99 percent of the organizations have their data strategies in place. The reason behind this is simple – the organizations are unable to trace the bad data that exists within their data. This is one problem which can be easily solved by adopting Hadoop testing methods which allows you to validate all of your data at increased testing speeds and boosts your data coverage resulting in better data quality.
Hadoop is one of the open source frameworks, is used as extension to big data analytics framework which are used by a large group of vendors. This type of framework makes work easy for the companies how they?re going to store and can use the data within the digital products as well as physical products (James, M. et al. 2011). We can analyze data using Hadoop, which is emerging as solution to
The paper “A Comparison to Approaches to Large-Scale Data Analysis” by Pavlo, compares and analyze the MapReduce framework with the parallel DBMSs, for large scale data analysis. It benchmarks the open source Hadoop, build over MapReduce, with two parallel SQL databases, Vertica and a second system form a major relational vendor (DBMS-X), to conclude that parallel databases clearly outperform Hadoop on the same hardware over 100 nodes. Averaged across 5 tasks on 100 nodes, Vertica was 2.3 faster than DBMS-X which in turn was 3.2 times faster than MapReduce. In general, the parallel SQL DBMSs were significantly faster and required less code to implement each task, but took longer to tune and load the data. Finally, the paper talk about
Within a Streamlined Data Refinery storage, data transformations, and query serving can be called into action by using products that are a match to existing skills and infrastructure. Since PDI jobs and transformations are flexible, this allows IT developers to run workloads in Hadoop in a
Today, most financial services organizations try to solve all their big data challenges using either grid or cluster technologies. These data analytic technologies have solved several problems around the world.
Spark is a cluster framework with an open source software. It was 1st invented by Berkely in AMP Lab. It was initially invented by Berkeley's AMP Lab and later donated to Apache Foundation software. Apache Spark follows the concept of RDD called resilient distributed dataset. This is just a readable dataset. Later it is added to Apache foundation software Spark is built on resilient distributed datasets (RDD) as a read-only multiset of data items. Spark core, Spark SQL, Spark MLib, Spark Streaming--- are the modules in the spark. Spark is a kind of API which has inbuilt memory data, a faster machine to run that. It enables information specialists to
Abstract - Hadoop Distributed File System, a Java based file system provides reliable and scalable storage for data. It is the key component to understand how a Hadoop cluster can be scaled over hundreds or thousands of nodes. The large amounts of data in Hadoop cluster is broken down to smaller blocks and distributed across small inexpensive servers using HDFS. Now, MapReduce functions are executed on these smaller blocks of data thus providing the scalability needed for big data processing. In this paper I will discuss in detail on Hadoop, the architecture of HDFS, how it functions and the advantages.
Data has always been analyzed within companies and used to help benefit the future of businesses. However, the evolution of how the data stored, combined, analyzed and used to predict the pattern and tendencies of consumers has evolved as technology has seen numerous advancements throughout the past century. In the 1900s databases began as “computer hard disks” and in 1965, after many other discoveries including voice recognition, “the US Government plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.” The evolution of data and how it evolved into forming large databases continues in 1991 when the internet began to pop up and “digital storage became more cost effective than paper. And with the constant increase of the data supplied digitally, Hadoop was created in 2005 and from that point forward there was “14.7 Exabytes of new information are produced this year" and this number is rapidly increasing with a lot of mobile devices the people in our society have today (Marr). The evolution of the internet and then the expansion of the number of mobile devices society has access to today led data to evolve and companies now need large central Database management systems in order to run an efficient and a successful business.
MapReduce is a simple and powerful programming model which enables development of scalable parallel applications to process large amount of data which is scattered on a cluster of machines. The original implementations of Map Reduce framework had some limitations which have been faced by many research follow up work after its introduction. It is gaining a lot of attraction in both research and industrial community as it has the capacity of processing large data. Map reduce framework used in different applications and for different purposes.