What Does Spark Can Give The Better Performance Than Hadoop Distributed File System?

2745 Words11 Pages
6. What is spark? Spark is an in memory cluster computing framework which falls under the open source Hadoop project but does not follow the two stage map-reduce paradigm which is done in Hadoop distributed file system (HDFS) and which is meant and designed to be faster. Spark, instead support a wide range of computations and applications which includes interactive queries, stream processing, batch processing, iterative algorithms and streaming by extending the idea of MapReduce model. The execution time is the most important factor for every process which processes large amount of data. While considering large amount of data, the time it usually takes for the exploration of data and execution of queries can be thought of in terms of…show more content…
Also it manages to reduce the overhead of maintaining separate tools. Spark provides flexible access as it offers API in different programming languages like Python, JAVA, Scala and SQL and it provides rich built in libraries to offer different functionalities. It can also be integrated with different big data tools like it can run on Hadoop clusters. 6.1 A Unified Stack Figure 1-1. The Spark Stack Spark is an integration of closely integrated components. These components can be combined to gather and can be used as if simply including multiple libraries in our project. There are multiple components in Spark and all are important in their own way and are dependent on each other. Spark can be considered as a computational engine at its core which is important for scheduling, monitoring applications and distribution of many applications and contains many computational tasks throughout the computing clusters. It uses high level components to handle the task workload such as Machine learning. In Spark, components are closely coupled which has several advantages such as any improvement in lower layers makes the higher level libraries and component perform better. Consider the case when the optimization is added, SQL and machine learning libraries also give better performance. Other most important benefit is that it reduces the costs of running the stack as it does not have to run different software independently. These costs are mostly related to
Open Document