How to light your 'Spark on a stick'

Testing the Spark Scala shell

In this section we'll use the Spark Scala shell to first create a scala data set named data. Then we'll convert the data to a Spark RDD with the parallelize command. Next, we'll run the filter transformation on the RDD. Finally, an action named collect will be called on the RDD to bring the transformed data back to the spark shell.

First run the following command to create a scala collection of numbers between 1 - 10,000:

scala> val data = 1 to 10000
data: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8,
 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88,
89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 1
07, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 1
23, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 1
39, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 1
55, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170...


Next, create an RDD named distData by parallelizing the scala collection data:

scala> val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at paralleliz
e at <console>:14


Finally, transform the RDD with a filter transformation to keep only the numbers under 10 and then run a collect action to bring the results back to the driver (the Spark shell in this case):

scala> distData.filter(_ < 10).collect()
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)


Now, if you refres the Spark Stages webpage at http://localhost:4040, you should see one completed stage: