Apache Spark Resilient Distributed Datasets

2046 Words9 Pages
Table of Contents Abstract 1 1 Introduction 1 2 Spark Core 2  2.1 Transformations 2  2.2 Actions 2 3 Spark SQL 3 4 Spark Streaming 4 5 GraphX 4 6 MLlib Machine Learning library 4 7 How to Interact with Spark 5 8 Shared Variables 5 8.1 Broadcast Variables: 5 8.2 Accumulators: 5 9 Sample Word Count Application 6 10 Summary 8 References 8 Abstract Cluster computing frameworks like MapReduce has been widely successful in solving numerous Big data problems. However, they tend to use one well none map and reduce pattern to solve these problems. There are many other class of problems that cannot fit into this closed box which may be handed using other set of programming model. This is where Apache Spark comes in to help solve these…show more content…
Spark 's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory. [3] The availability of RDDs facilitates the implementation of both iterative algorithms that visit their dataset multiple times in a loop, and exploratory data analysis, which is the querying of data repeatedly. The latency of applications builds with spark compared to Hadoop, a MapReduce platform may be reduced by several orders of magnitude. [3] Another key aspect of apache spark is that it makes writing code easy and quickly. This is as a result of over 80 high level operators included in with spark library. This is evidence as spark comes with a REPL an interactive shell. The REPL can be used t0 test the outcome of each line of code without coding the entire job. As a result, ad-hoc data analysis is possible and code is made much shorter. Apache Spark is well complemented with set of high-level libraries that can be easily used in the same application. These include Spark SQL, Spark Streaming, MLlib and GraphX which we will explain in details in this paper. 2 Spark Core Spark core is at the base of the apache spark foundation. It provides distributed task dispatching, scheduling, and basic I/O capabilities which is exposed through a common API (JAVA, Scala, Python and R) centered on the RDD concept. [1] As the core it provides the following:  Memory management and fault recovery.  Scheduling,
Open Document