Introduction to Apache Spark
Upcoming Spark Meetups
• 4 meetups to put basics in place– Intro (today)
– SQL & Streaming
– Machine Learning & Graphs
– DevOps & Troubleshooting
• Followed by deep-dives in machine learning– Classification & Regression (Random Forests, Support Vector Machines)
– Clustering
– Feature Extraction and Transformation
– Collaborative Filtering
– Dimensionality Reduction
Agenda for Meetup 1
• What is Apache Spark?
• The Stack
• Resilient Distributed Datasets
• Transformations & Actions
The Book
Spark Core
Your Applications
The Stack
Spark SQL MLLib GraphXSpark
Streaming
Mesos YARN Standalone
sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19").map(t => "Name: " + t(0)).collect().foreach(println)
Distributed Execution
Driver
Spark Context
Worker Node
Executor
Task
Task
Worker Node
Executor
Task
Task
Resilient Distributed Datasets (RDD)
• Immutable: never modified – just transformed to new RDDs
• Distributed: split into multiple partitions and spread across multiple servers in a cluster
• Resilient: can be re-computed if they get destroyed
• Created by:– Loading external data
– Distributing a collection of objects in the driver program
RDD Implementation
• Array of partitions
• List of dependencies on parent RDDs
• Function to compute a partition given it’s parents– Returns Iterator over a partition
• Preferred locations: list of strings for each partition (Nil by default)
• Partitioner (None by default)
Persistence / Caching
• By default RDDs (and all of their dependencies) are recomputed every time an action is called on them!
• Need to explicitly tell Spark when to persist
• Options:– Default: stored in heap as unserialized objects (pickled objects for
Python)
– Memory only: serialized or not
– Memory and disk: spills to disk, option to serialize in memory
– Disk only
• Tachyon: off-heap caching (experimental)– Aims to make Spark more resilient
– Avoid GC overheads
Dependency Types: Narrow
E.g. map, filter
E.g. union
E.g. join with co-partitioned input
Each partition of parent is used by at mostone partition of the child
Dependency Types: Wide
E.g. groupByKey
E.g. join with inputsnon co-partitioned
Each partition of the parent is used by more than one partition of the child
Transformations
• Return a new RDD
• Lazy evaluation
• Single RDD transformations: map, flatMap, filter, distinct
• Pair RDDs: keyBy, reduceByKey, groupByKey, combineByKey, mapValues, flatMapValues, sortByKey
• Two RDD transformations: union, intersection, subtract, cartesian
• Two pair RDDs: join, rightOuterJoin, leftOuterJoin, cogroup
Actions
• Force evaluation of the transformations and return a value to the driver program or write to external storage
• Actions on RDDs:– reduce, fold, aggregate
– foreach(func), collect
– count, countByValue
– top(num)
– take(num), takeOrdered(num)(ordering)
• Actions on pair RDDs:– countByKey
– collectAsMap
– lookup(key)
Single RDD Transformations
map and flatMap
• map takes a function that transforms each element of a collection: map(f: T => U)
• RDD[T] => RDD[U]
• flatMap takes a function that transforms a single element of a collection into a sequence of elements: flatMap(f: T => Seq[U])
• Flattens out the output into a single sequence
• RDD[T] => RDD[U]
filter, distinct
• filter takes a (predicate) function that returns true if an element should be in the output collection: map(f: T => Bool)
• distinct removes duplicates from the RDD
• Both filter and distinct transform from RDD[T] => RDD[T]
Actions
reduce, fold & aggregate
• reduce takes a function that combines pairwise element of a collection: map(f: (T, T) => T)
• fold is like reduce except it takes a zero value i.e. fold(zero: T)(f: (T, T) => T)
• reduce and fold: RDD[T] => T
• aggregate is the most general form
• aggregate(zero: U)(seqOp: (U, T) => U, combOp: (U, U) => U)
• aggregate: RDD[T] => U
Pair RDD Transformations
keyBy, reduceByKey
• keyBy creates tuples of the elements in an RDD by applying a function: keyBy(f: T => K)
• RDD[ T ] => RDD[ (K, T) ]
• reduceByKey takes a function that takes a two values and returns a single value: reduceByKey(f: (V,V) => V)
• RDD[ (K, V) ] => RDD[ (K, V) ]
groupByKey
• Takes a collection of key-value pairs and no parameters
• Returns a sequence of values associated with each key
• RDD[ ( K, V ) ] => RDD[ ( K, Iterable[V] ) ]
• Results must fit in memory
• Can be slow – use aggregateByKey or reduceByKey where possible
• Ordering of values not guaranteed and can vary on every evaluation
combineByKey
• def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,partitioner: Partitioner,mapSideCombine: Boolean = true,serializer: Serializer = null)
• RDD [ (K, V) ] => RDD[ (K, C) ]• createCombiner called per partition when a new key is found• mergeValue combines a new value to an existing accumulator• mergeCombiners with results from different partitions• Sometimes map-size combine not useful e.g. groupByKey• groupByKey, aggregateByKey and reduceByKey all implemented
using combineByKey
map vs mapValues
• map takes a function that transforms each element of a collection: map(f: T => U)
• RDD[T] => RDD[U]
• When T is a tuple we may want to only act on the values – not the keys
• mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W)
• Where RDD[ (K, V) ] => RDD[ (K, W) ]
• NB: use mapValues when you can: avoids reshuffle when data is partitioned by key
Two RDD Transformations
Pseudo-set: union, intersection, subtract, cartesian
• rdd.union(otherRdd): RRD containing elements from both
• rdd.intersection(otherRdd): RDD containing only elements found in both
• rdd.subtract(otherRdd): remove content of one from the other e.g. removing training data
• rdd.cartesian(otherRdd): Cartesian product of two RDDs e.g. similarity of pairs: RDD[T] RDD[U] => RDD[ (T, U) ]
Two Pair RDD Transformations
join, rightOuterJoin, leftOuterJoin, cogroup
• Join: RDD[ ( K, V) ] and RDD[ (K, W) ] => RDD[ ( K, (V,W) ) ]
• Cogroup: RDD[ ( K, V) ] and RDD[ (K, W) ] => RDD[ ( K, ( Seq[V], Seq[W] ) ) ]
• rightOuterJoin and leftRightJoin when keys must be present in left / right RDD
Partition-specific Transformations and Actions
mapPartitions, mapPartitionsWithIndex, and foreachPartition
• Same as map and foreach except they operate on a per partition basis
• Useful for when you have setup code (DB, RNG etc.) but don’t want to call it for each partition
• You can set preservesPartitioning when you are not altering the keys used for partitioning to avoid unnecessary shuffling– As with mapValues in the last slide
MiscellaneousTransformations
glom
• Treat a partition as an array (it must fit in memory though)
• RDD[ T ] => RDD[ Array[T] ]
• val maxValue = dataRDD.glom().map((value:Array[Double]) => value.max).reduce(_ max _)
• Useful when you want to implement RDD operations using matrix libraries that are optimized to operate on arrays
Q & A