Service Symphony

RDD – Scala Collections on Steroids

Screen Shot 2015-11-19 at 11.04.39

In this article we will look at one of the core underpinning constructs of the Spark architecture and programming model, RDD. RDD is the abbreviation for Resilient Distributed Dataset, which is essentially an immutable resilient distributed collection of data, which can be loaded, transformed and aggregated across the boundaries of multiple process spaces. As someone who is coming from a Scala background, I consider RDDs as something that offers a natural programming model for working against data like Scala collections, at the same time having the ability to parallelise and distribute the underlying data and it is processing across multiple compute nodes. The RDD API offers most of the higher order functions and combinators available in Scala collections.

Note: For purpose of the example in this article, we will be using the Spark Scala shell, which is very similar to the Scala REPL familiar to Scala developers, and for data we will be using the same sold house price data we used from the previous article. We are using version 1.5.1 of Spark, and it is assumed that you have downloaded and extracted the Spark distribution and made the contents of the bin directory available in the path. For convenience it will be better to create a working directory, copy the data file in there and boot the Spark shell from that directory.

Starting with RDDs

In the Scala API RDDs are represented by the parameterised abstract class org.apache.spark.rdd.RDD[T]. This class defines the core abstractions of operations that can be performed against a distributed collections of objects. The class also has a number of typed sub-classes based on the type of the objects contained within the RDD, and offer additional operations pertinent to that contained type, through implicit conversions. For example an RDD of double values, offers an operations to apply statistical functions against the contained data.

There are primarily two ways to construct an RDD,

  1. By parallelising an existing collection in memory
  2. Loading data from an external dataset

Parallelising a Collection

The easiest way to create an RDD instance is by parallelising a collection that is available in-memory as shown below,

val rdd = sc.parallelize(List(1, 2, 3, 4))

The reference sc refers to an instance of SparkContext available within Spark shell. The reference rdd is now a collection that can be operated upon in parallel. In reality, one would use this mechanism to create an RDD only for the purpose of testing. In real world use cases, RDD instances are created by storing data from external sources.

In the above operation one could pass in the number of partitions required using the overloaded version of the function. If not specified, Spark will try to second guess the number of partitions required by based on the number of CPUs available in the cluster topology.

Loading from an External Source

In this section we will reuse the house price data we downloaded in the last article. Spark can load data from a variety of external datasources local like text files, HDFS files, relational databases, Amazon S3 buckets etc. Additionally, one could write a data adaptor that can load data from a datasource of choice.

A simplest way to load the data is as shown below,

scala> val rawData = sc.textFile("pp-2014.csv")
rawData: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at textFile at <console>:21

Please note that at this point in time, the reference rawData is only a pointer to the data and none of the data is loaded. On the RDD instance one can invoke a series of operations, which are generally classified as transformations or actions. Generally the underlying data is loaded the first time an action is invoked on the RDD.

Each element of the text RDD is a line in the underlying file. If one wants to print the contents of the file, one could do,

rawData.foreach(line => println(line))

This will bring all the data to the driver and print the contents to the shell console. Obviously, with large datasets one would seldom, do this. Rather, the data will be operated upon in parallel using classic map-reduce techniques, and condensed data will be brought back to the driver for further analysis.

If we want to just look at the first element of the RDD, we could do the following.

scala> rawData.first
res9: String = "{BDBD7612-68D3-47DB-8F02-01CB6C1F4F28}","370000","2014-02-28 00:00","HA9 8QF","S","N","F","13","","SECOND AVENUE","","WEMBLEY","BRENT","GREATER LONDON","A","A"

In our example, each line corresponds to a comma separated list of values pertinent to sold house prices, where the second field corresponds to the actual price. The example below processes the file in parallel by extracting the second field and totals the value across all the records.

scala> val prices = rawData.map(line => line.split(",")(1).replace("\"", "").toDouble)
prices: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[7] at map at :23

scala> val total = prices.reduce((a, b) => a + b)
total: Double = 2.74804468583E11                                                

The first line runs a map operation on every element of the RDD, splitting the line into a CSV, extracting the second column, stripping double quotes and converting it to a double value. At this point no data is loaded into memory of the worker instances. So the reference prices refers to a new RDD created by transforming the rawData RDD. On the next line we invoke an action, which is a classic reduce function, that walks through the list of prices and add them up. This is the point Spark loads the data in parallel across the cluster and apply the operations.

What Next

In the next edition we will look at RDD transformations in detail.

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *

two × 3 =

Some of our
Clients
Service Symphony

Service symphony © 2024

Quick Links
Contact Info

20-22 Wenlock Road
London
N1 7GU
United Kingdom

Support

0800 011 4386