Introduction to Spark Eric Eijkelenboom - UserReport - userreport.com
Jan 26, 2015
• What is Spark and why should I care?
• Architecture and programming model
• Examples
• Mini demo
• Related projects
RTFM• A general-purpose computation framework that leverages distributed
memory
• More flexible than MapReduce (it supports general execution graphs)
• Linear scalability and fault tolerance
• It supports a rich set of higher-level tools including
• Shark (Hive on Spark) and Spark SQL
• MLlib for machine learning
• GraphX for graph processing
• Spark Streaming
Who cares?
!
!
!
!
• Slow due to serialisation & replication
• Inefficient for iterative computing & interactive querying
Limitations of MapReduceInput iter. 1 iter. 2 . . .
HDFS read
HDFS write
HDFS read
HDFS write
Map
Map
Map
Reduce
Reduce
Input Output
Leveraging memory
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Not tied to 2-stage MapReduce paradigm
1. Extract a working set 2. Cache it 3. Query it repeatedly
So, Spark is…• In-memory analytics, many times faster than
Hadoop/Hive
• Designed for running iterative algorithms & interactive querying
• Highly compatible with Hadoop’s Storage APIs
• Can run on your existing Hadoop Cluster Setup
• Programming in Scala, Python or Java
Spark stack
Architecture
HDFS
Datanode Datanode Datanode....Spark Worker Spark Worker Spark Worker....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
Architecture
HDFS
Datanode Datanode Datanode....Spark Worker Spark Worker Spark Worker....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
• YARN • Mesos • Standalone
Programming model• Resilient Distributed Datasets (RDDs) are basic building blocks
• Distributed collection of objects, cached in-memory across cluster nodes
• Automatically rebuilt on failure
• RDD operations
• Transformations: create new RDDs from existing ones
• Actions: return a value to the master node after running a computation on the dataset
As you know…• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark
As you know…• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark
Blue code: Spark operationsRed code: functions (closures) that get passed to the cluster automatically
Text search
Text search
In memory text search: !!caches the RDD in memory for faster reuse
Logistic regression
!
• 100 GB of data on a 100 node cluster
Easy unit testing
Spark shell
Mini demo
Hive on Spark = Shark• A large scale data warehouse system just like Hive
• Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs)
• Built on top of Spark (thus a faster execution engine)
• Provision of creating in-memory materialized tables (Cached Tables)
• And cached tables utilise columnar storage instead of raw storage
Shark
Shark uses the existing Hive client and metastore
MLlib• Machine learning library based on Spark
!
!
• Supports a range of machine learning algorithms, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and more
Spark Streaming• Write streaming applications in the same way as
batch applications
• Reuse code between batch processing and streaming
• Write more than analytics applications:
• Join streams against historical data
• Run ad-hoc queries on stream state
Spark Streaming• Count tweets on a sliding window
!
!
• Find words with higher frequency than historic data
GraphX: graph computing