Budde Tn buddet1989@gmail.com • Over all 4 years of IT experience in analysis, design and development using Hadoop, Java and J2EE.
• 3+ years' experience in Big Data technologies and Hadoop ecosystem projects like Map Reduce, YARN, HDFS, Apache Cassandra, Spark, NoSQL, HBase, Oozie, Hive, Tableau, Sqoop, Pig, Storm, Kafka, HCatalog, Zoo Keeper and Flume
• Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
• Knowledge of Data Analytics and Business Analytics processes.
• Hands on experience with Spark streaming to receive real time data using Kafka
• Creating Spark SQL queries for faster requests.
• Experience
…show more content…
• Good Understanding of RDBMS through Database Design, writing queries using databases like Oracle, SQL Server, DB2 and MySQL.
• Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
• A team player and self-motivator possessing excellent analytical, communication, problem solving, decision making and Organizational skills
WORK EXPERIENCE
Hadoop Developer
VINMATICS- ON - January 2017 to Present
Responsibilities:
• Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
• Wrote the Map Reduce jobs to parse the web logs which are stored in HDFS.
• Importing and exporting data into HDFS and HIVE using Sqoop.
• Involved in working with Impala for data retrieval process.
• Experience in partitioning the Big Data according the business requirements using Hive Indexing, partitioning and Bucketing.
• Responsible for design development of Spark SQL Scripts based on Functional Specifications
• Responsible for Spark Streaming configuration based on type of Input Source
• Developed the services to run the Map-Reduce jobs as per the requirement basis.
• Responsible for loading data from UNIX file systems to HDFS. Installed and configured Hive and also written Pig/Hive UDFs.
• Responsible to manage data coming from different sources.
• Developing business logic using scala.
• Writing MapReduce (Hadoop)
Hadoop \cite{white2012hadoop} is an open-source framework for distributed storage and data-intensive processing, first developed by Yahoo!. It has two core projects: Hadoop Distributed File System (HDFS) and MapReduce programming model \cite{dean2008mapreduce}. HDFS is a distributed file system that splits and stores data on nodes throughout a cluster, with a number of replicas. It provides an extremely reliable, fault-tolerant, consistent, efficient and cost-effective way to store a large amount of data. The MapReduce model consists of two key functions: Mapper and Reducer. The Mapper processes input data splits in parallel through different map tasks and sends sorted, shuffled outputs to the Reducers that in turn groups and processes them using a reduce task for each group.
Expertise in amazon AWS cloud which includes services like: EC2, S3, VPC, ELB, IAM, Cloud Front, Cloud Watch, Elastic Beanstalk, Security Groups, CodeCommit, CodePipeline, CodeDeploy.
Each map task produces an intermediate data set that is used by the reduce task to combine the map task results. This paper also proposes various extensions to the framework to allow the users to customize the data partitioning as well as combiner. Also, the framework provides a mechanism for the user programs to track relevant metrics and publish them. This paper describes the various error scenarios that happen in a large cluster of commodity hardware machines. Fault tolerance to handle various map tasks
MapReduce Parallel programming model if we ever get a chance. In Hadoop, there are two nodes in the cluster when using the algorithm, Master node and Slave node. Master node runs Namenode, Datanode, Jobtracker and Task tracker processes. Slave node runs the Datanode and Task tracker processes. Namenode manages partitioning of input dataset into blocks and on which node it has to store. Lastly, there are two core components of Hadoop: HDFS layer and MapReduce layer. The MapReduce layer read from and write into HDFS storage and processes data in parallel.
Hadoop1 provides a distributed filesystem and a framework for the analysis and transformation of very large data sets using the MapReduce [DG04] paradigm. While the interface to HDFS is patterned after the Unix filesystem, faithfulness to standards was sacrificed in favor of improved performance for the applications at hand.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a Parallel and distributed computing environment. It makes Use of the commodity hardware Hadoop is Highly Scalable and Fault Tolerant. Hadoop runs in cluster and eliminates the use of a Super computer. Hadoop is the widely used big data processing engine with a simple master slave setup. Big Data in most companies are processed by Hadoop by submitting the jobs to Master. The Master distributes the job to its cluster and process map and reduce tasks sequencially.But nowdays the growing data need and the and competition between Service Providers leads to the increased submission of jobs to the Master. This Concurrent job submission on Hadoop forces us to do Scheduling on Hadoop Cluster so that the response time will be acceptable for each job.
During the period of my curriculum practical training I have learned the HADOOP technology, initial three weeks I was taught the concepts that I should be aware of in order to understand the whole concepts of HADOOP technology. In this regard I was taught collection framework in java at first.
Research topic was derived from the understanding of query processing in MySQL and Hadoop, the database performance issues, performance tuning and the importance of database performance. Thus, it was decided to develop a comparative analysis to observe the effectiveness of the performance of MySQL (non cluster) and Hadoop in structured and unstructured dataset (Rosalia, 2015). Furthermore, the analysis included a comparison between those two platforms in two variance of data size.
Faster, better decision making: With the speed of Hadoop and in-memory analytics, combined with the capability to analyze new sources of data, businesses are able to analyze data immediately and make decisions based on what they’ve learned.
Hadoop is one of the open source frameworks, is used as extension to big data analytics framework which are used by a large group of vendors. This type of framework makes work easy for the companies how they?re going to store and can use the data within the digital products as well as physical products (James, M. et al. 2011). We can analyze data using Hadoop, which is emerging as solution to
Across Industry verticals there is increasing adoption of Hadoop for Information Management and Analytics. Many have realized that in addition to new business related capabilities Hadoop also offers a host of options for IT Simplification and cost reduction. Initiatives such as Offloads are at the heart of such optimization. That said, capacity planning is the first step that needs to be carried out successfully for either a IT driven use case or a business driven use case. This paper takes a look at why Big Data processing frameworks such as Hadoop clusters require careful capacity planning for the timely launch of Big Data based capabilities. Additionally, it discusses how capacity planning can facilitate appropriate service level agreement (SLA) guarantees and ensure deliveries within defined budgets. These types of guarantees with sets of standard hardware configuration are the key for effective capacity planning. The key constituent of overall capacity management strategy for the Hadoop eco system is Cluster Capacity Planning. It is this part of the strategy that caters to the troublesome and unavoidable
Spark is a cluster framework with an open source software. It was 1st invented by Berkely in AMP Lab. It was initially invented by Berkeley's AMP Lab and later donated to Apache Foundation software. Apache Spark follows the concept of RDD called resilient distributed dataset. This is just a readable dataset. Later it is added to Apache foundation software Spark is built on resilient distributed datasets (RDD) as a read-only multiset of data items. Spark core, Spark SQL, Spark MLib, Spark Streaming--- are the modules in the spark. Spark is a kind of API which has inbuilt memory data, a faster machine to run that. It enables
Abstract - Hadoop Distributed File System, a Java based file system provides reliable and scalable storage for data. It is the key component to understand how a Hadoop cluster can be scaled over hundreds or thousands of nodes. The large amounts of data in Hadoop cluster is broken down to smaller blocks and distributed across small inexpensive servers using HDFS. Now, MapReduce functions are executed on these smaller blocks of data thus providing the scalability needed for big data processing. In this paper I will discuss in detail on Hadoop, the architecture of HDFS, how it functions and the advantages.
Data has always been analyzed within companies and used to help benefit the future of businesses. However, the evolution of how the data stored, combined, analyzed and used to predict the pattern and tendencies of consumers has evolved as technology has seen numerous advancements throughout the past century. In the 1900s databases began as “computer hard disks” and in 1965, after many other discoveries including voice recognition, “the US Government plans the world’s first data center to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape.” The evolution of data and how it evolved into forming large databases continues in 1991 when the internet began to pop up and “digital storage became more cost effective than paper. And with the constant increase of the data supplied digitally, Hadoop was created in 2005 and from that point forward there was “14.7 Exabytes of new information are produced this year" and this number is rapidly increasing with a lot of mobile devices the people in our society have today (Marr). The evolution of the internet and then the expansion of the number of mobile devices society has access to today led data to evolve and companies now need large central Database management systems in order to run an efficient and a successful business.
MapReduce is a simple and powerful programming model which enables development of scalable parallel applications to process large amount of data which is scattered on a cluster of machines. The original implementations of Map Reduce framework had some limitations which have been faced by many research follow up work after its introduction. It is gaining a lot of attraction in both research and industrial community as it has the capacity of processing large data. Map reduce framework used in different applications and for different purposes.