How to light your 'Spark on a stick'

Your USB 'Spark on a stick' contains a copy of the open source Apache 2.0 licensed Spark 1.1 project, including the higher libraries like: MLlib, SparkSQL, Spark Streaming, Tachyon, BlinkDB, SparkR, etc.

This gitbook will bootstrap you on your way to learning Spark from a DevOps perspective, so a little bit of development and a bit of operations.

You'll see how to:

run Spark on your local Windows/OS X/linux based laptop
launch the Spark web UIs
launch the Spark Scala & Python shells
read data from a file, transform it with Spark
cache the transformed data into memory
take an action to write the final RDD back to local disk
locate the Spark log files

The accompanying slides for the 1-day "Intro to Apache Spark" workshop can be found here: http://training.databricks.com/workshop/itas_workshop.pdf

There is also a 4-part recording of the "Intro to Apache Spark" training workshop from the San Francisco Spark Summit 2014 available here: https://www.youtube.com/watch?v=VWeWViFCzzg&list=PLTPXxbhUt-YWSgAUhrnkyphnh0oKIT8-j

Note, this is a Windows version of the lab document. There are comments included for running on OS X or linux, but in general you may need to tweak the instructions a bit for different OSes.

This document is licensed under Creative Commons, so feel free to print it, share it and especially add to it. This section covers how to make GitHub pull requests to help grow this document.