Dublin Spark Meetup - Meetup 1 - Intro to Spark

Introduction to Apache Spark

Upcoming Spark Meetups

• 4 meetups to put basics in place– Intro (today)

– SQL & Streaming

– Machine Learning & Graphs

– DevOps & Troubleshooting

• Followed by deep-dives in machine learning– Classification & Regression (Random Forests, Support Vector Machines)

– Clustering

– Feature Extraction and Transformation

– Collaborative Filtering

– Dimensionality Reduction

Agenda for Meetup 1

• What is Apache Spark?

• The Stack

• Resilient Distributed Datasets

• Transformations & Actions

The Book

Spark Core

Your Applications

The Stack

Spark SQL MLLib GraphXSpark

Streaming

Mesos YARN Standalone

sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19").map(t => "Name: " + t(0)).collect().foreach(println)

Distributed Execution

Driver

Spark Context

Worker Node

Executor

Task

Task

Worker Node

Executor

Task

Task

Resilient Distributed Datasets (RDD)

• Immutable: never modified – just transformed to new RDDs

• Distributed: split into multiple partitions and spread across multiple servers in a cluster

• Resilient: can be re-computed if they get destroyed

• Created by:– Loading external data

– Distributing a collection of objects in the driver program

RDD Implementation

• Array of partitions

• List of dependencies on parent RDDs

• Function to compute a partition given it’s parents– Returns Iterator over a partition

• Preferred locations: list of strings for each partition (Nil by default)

• Partitioner (None by default)

Persistence / Caching

• By default RDDs (and all of their dependencies) are recomputed every time an action is called on them!

• Need to explicitly tell Spark when to persist

• Options:– Default: stored in heap as unserialized objects (pickled objects for

Python)

– Memory only: serialized or not

– Memory and disk: spills to disk, option to serialize in memory

– Disk only

• Tachyon: off-heap caching (experimental)– Aims to make Spark more resilient

– Avoid GC overheads

Dependency Types: Narrow

E.g. map, filter

E.g. union

E.g. join with co-partitioned input

Each partition of parent is used by at mostone partition of the child

Dependency Types: Wide

E.g. groupByKey

E.g. join with inputsnon co-partitioned

Each partition of the parent is used by more than one partition of the child

Transformations

• Return a new RDD

• Lazy evaluation

• Single RDD transformations: map, flatMap, filter, distinct

• Pair RDDs: keyBy, reduceByKey, groupByKey, combineByKey, mapValues, flatMapValues, sortByKey

• Two RDD transformations: union, intersection, subtract, cartesian

• Two pair RDDs: join, rightOuterJoin, leftOuterJoin, cogroup

Actions

• Force evaluation of the transformations and return a value to the driver program or write to external storage

• Actions on RDDs:– reduce, fold, aggregate

– foreach(func), collect

– count, countByValue

– top(num)

– take(num), takeOrdered(num)(ordering)

• Actions on pair RDDs:– countByKey

– collectAsMap

– lookup(key)

Single RDD Transformations

map and flatMap

• map takes a function that transforms each element of a collection: map(f: T => U)

• RDD[T] => RDD[U]

• flatMap takes a function that transforms a single element of a collection into a sequence of elements: flatMap(f: T => Seq[U])

• Flattens out the output into a single sequence


filter, distinct

• filter takes a (predicate) function that returns true if an element should be in the output collection: map(f: T => Bool)

• distinct removes duplicates from the RDD

• Both filter and distinct transform from RDD[T] => RDD[T]

Actions

reduce, fold & aggregate

• reduce takes a function that combines pairwise element of a collection: map(f: (T, T) => T)

• fold is like reduce except it takes a zero value i.e. fold(zero: T)(f: (T, T) => T)

• reduce and fold: RDD[T] => T

• aggregate is the most general form

• aggregate(zero: U)(seqOp: (U, T) => U, combOp: (U, U) => U)

• aggregate: RDD[T] => U

Pair RDD Transformations

keyBy, reduceByKey

• keyBy creates tuples of the elements in an RDD by applying a function: keyBy(f: T => K)

• RDD[ T ] => RDD[ (K, T) ]

• reduceByKey takes a function that takes a two values and returns a single value: reduceByKey(f: (V,V) => V)

• RDD[ (K, V) ] => RDD[ (K, V) ]

groupByKey

• Takes a collection of key-value pairs and no parameters

• Returns a sequence of values associated with each key

• RDD[ ( K, V ) ] => RDD[ ( K, Iterable[V] ) ]

• Results must fit in memory

• Can be slow – use aggregateByKey or reduceByKey where possible

• Ordering of values not guaranteed and can vary on every evaluation

combineByKey

• def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,partitioner: Partitioner,mapSideCombine: Boolean = true,serializer: Serializer = null)

• RDD [ (K, V) ] => RDD[ (K, C) ]• createCombiner called per partition when a new key is found• mergeValue combines a new value to an existing accumulator• mergeCombiners with results from different partitions• Sometimes map-size combine not useful e.g. groupByKey• groupByKey, aggregateByKey and reduceByKey all implemented

using combineByKey

map vs mapValues

• map takes a function that transforms each element of a collection: map(f: T => U)


• When T is a tuple we may want to only act on the values – not the keys

• mapValues takes a function that maps the values in the inputs to the values in the output: mapValues(f: V => W)

• Where RDD[ (K, V) ] => RDD[ (K, W) ]

• NB: use mapValues when you can: avoids reshuffle when data is partitioned by key

Two RDD Transformations

Pseudo-set: union, intersection, subtract, cartesian

• rdd.union(otherRdd): RRD containing elements from both

• rdd.intersection(otherRdd): RDD containing only elements found in both

• rdd.subtract(otherRdd): remove content of one from the other e.g. removing training data

• rdd.cartesian(otherRdd): Cartesian product of two RDDs e.g. similarity of pairs: RDD[T] RDD[U] => RDD[ (T, U) ]

Two Pair RDD Transformations

join, rightOuterJoin, leftOuterJoin, cogroup

• Join: RDD[ ( K, V) ] and RDD[ (K, W) ] => RDD[ ( K, (V,W) ) ]

• Cogroup: RDD[ ( K, V) ] and RDD[ (K, W) ] => RDD[ ( K, ( Seq[V], Seq[W] ) ) ]

• rightOuterJoin and leftRightJoin when keys must be present in left / right RDD

Partition-specific Transformations and Actions

mapPartitions, mapPartitionsWithIndex, and foreachPartition

• Same as map and foreach except they operate on a per partition basis

• Useful for when you have setup code (DB, RNG etc.) but don’t want to call it for each partition

• You can set preservesPartitioning when you are not altering the keys used for partitioning to avoid unnecessary shuffling– As with mapValues in the last slide

MiscellaneousTransformations

glom

• Treat a partition as an array (it must fit in memory though)

• RDD[ T ] => RDD[ Array[T] ]

• val maxValue = dataRDD.glom().map((value:Array[Double]) => value.max).reduce(_ max _)

• Useful when you want to implement RDD operations using matrix libraries that are optimized to operate on arrays

Q & A

Dublin Spark Meetup - Meetup 1 - Intro to Spark

Technology

Dublin Spark Meetup - Meetup 1 - Intro to Spark