Introducing Apache Spark
This is the beginning of a series of articles, looking at practical aspects of using Apache Spark. Apache Spark is a general purpose distributed compute platform, providing a robust and resilient platform for processing fast and big data.
Spark Overview
Spark started out as research project at UC Berkeley in 2009. It is currently hosted as an ASF top level project (http://spark.apache.org), under ASL 2.0, and is one of the most active ASF projects as well as OSS project (https://github.com/apache/spark/graphs/commit-activity).
It is built by more than 800 developers from around 200 companies, and currently has 600,000 lines of code, around 70-80% of which is written in Scala.
Getting Spark
The latest stable release is 1.5.1 as of September 2015 and can be downloaded from https://spark.apache.org/downloads.html. It comes with pre-packaged libraries for working with different versions of Hadoop, for the purpose of accessing HDFS (Hadoop Distributed File System) and also being able to run Spark on YARN, the Hadoop cluster manager. You may also download and build from source using Apache Maven. To embed Spark into your application build system like Maven or SBT,, you can use the group Id org.apache.spark and the artefact Id spark-core_<SCALA_VERSION>. If you are using SBT, it will automatically resolve the Scala version.
Spark Architecture
Apache Spark is a general purpose distributed computing platform, primarily focused on parallel processing of big and fast data. Those who have worked on traditional map-reduce applications based on infrastructure would have come across the frequent need to read and write data to the disk of intermediate data in a map-reduce processing pipeline. The Spark architecture treats map-reduce pipelines as direct acyclic graphs of transformations and actions against data, that are lazily executed across machine boundaries in parallel with a model geared towards in-memory processing. It also has the ability to checkpoint the pipeline to eliminate expensive recomputing of data.
Apache Spark is made up of a number of distinct modules each addressing a specific facet of distributed data processing.
Spark Core
This provides the key Spark infrastructure for cluster computing, abstractions for sourcing and sinking data from a variety of datasources like local file systems, Hadoop HDFS, RDMS as well as NoSQL sources, and also the core abstraction for Spark’s RDD (Resilient Distributed Datasets), which are essentially Scala collections on steroids. RDDs provide a very similar abstraction to Scala collections, at the same time providing the ability to perform transformations and actions against the data across multiple machine boundaries.
The RDD API provides a unified access model to heterogeneous datasources in varying formats. The adapters provided by Spark allow data to be read efficiently in parallel and processed in parallel applying transformations and actions.
Spark SQL
Spark SQL provides an abstraction over RDDs to work with structured data using a SQL interface. Spark SQL also includes the DataFrame API, which provides a higher level abstraction over RDD for working with columnar data. Spark SQL enables a rich abstraction across heterogeneous datasources, allowing data to be merged from different datasources and provide a high degree of optimisations delegating aggregating and filtering facets of data processing at the source for efficient data processing.
Spark Streaming
Whereas Spark Core mainly deals with processing batch oriented data, Spark Streaming is all about fast processing of live streams of data. Spark Streaming leverages the RDD API to be applied in the context of live streams of data that can be parallelised on time windows.
MLLib
MLLib is Spark’s machine learning library that provides an infrastructure to apply common ML algorithms in resilient clustered distributed computing infrastructure. MLLib supports all the machine learning facets like classification, regression, prediction etc to be applied in clustered distributed compute platform.
GraphX
Spark GraphX library provides an API for processing graph oriented data in a distributed clustered environment. Again the core graph processing abstracts are built on top of the RDD API.
Language Bindings
Spark offers API binding in following languages,
- Java
- Scala
- Python
- R
Of all the above, Scala is the best choice for a number of reasons,
- 80% of Spark is written in Scala, so it is “native” to Spark and unlike R and Python application logic runs in-process
- Most of the API enhancements will be available first in Scala before other bindings
- Scala being a functional language, its abstractions and core semantics lend well to the map-reduce world and also shipping contained logic to multiple nodes in a cluster
- The API with its type inference and support for higher order functions are more succinct and less verbose than using the Java API
Clustering Infrastructure
Spark can either run as a single node with multiple threads (for development) or across multiple physical nodes. In the multi-process mode, it can either use its built-in clustering infrastructure or leverage Hadoop YARN or the the emerging data centre O/S Apache Mesos. The core abstraction is the driver that starts the user application and one or more executors to which the driver ships the application code to read, process and write data in parallel.
Applications of Apache Spark
Spark is a general purpose distribute compute platform focusing on processing large amounts of data efficiently and rapidly, and as such can lend itself very well into numerous data science and data processing applications. It is well suited for data scientists for arbitrary analysis with its availability of an interactive shell and choice of language bindings and for data engineers to build robust data processing applications.
What Next?
We want to make this series as interesting as possible, without being bogged down on theory, taking the readers through real world practical applications of Spark. In the next edition we will look at building a simple Spark based application using Scala and SBT.
Leave a Reply