![]() ![]() Data re-use is accomplished through the creation of DataFrames, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back-resulting in a much faster execution. ![]() Spark was created to address the limitations to MapReduce, by doing processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. Because each step requires a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O. With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS. However, a challenge to MapReduce is the sequential multi-step process it takes to run a job. Developers can write massively parallelized operators, without having to worry about work distribution, and fault tolerance. Hadoop MapReduce is a programming model for processing big data sets with a parallel, distributed algorithm. It has received contribution by more than 1,000 developers from over 200 organizations since 2009. In 2017, Spark had 365,000 meetup members, which represents a 5x growth over two years. Today, Spark has become one of the most active projects in the Hadoop ecosystem, with many organizations adopting Spark alongside Hadoop to process big data. Spark can run standalone, on Apache Mesos, or most frequently on Apache Hadoop. In June, 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February, 2014. The first paper entitled, “Spark: Cluster Computing with Working Sets” was published in June 2010, and Spark was open sourced under a BSD license. The goal of Spark was to create a new framework, optimized for fast iterative processing like machine learning, and interactive data analysis, while retaining the scalability, and fault tolerance of Hadoop MapReduce. Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab, a collaboration involving students, researchers, and faculty, focused on data-intensive application domains. ![]()
0 Comments
Leave a Reply. |