Spark!

By @przemur from

HTTP://ABOUT.ME/PRZEMEK.MACIOLEK/• Data Scientist, Hadoop user since 2009

• Did research for Academia, data mined for oil&gas exploration industry, cofounded Data Science startup, built Big Data team in Base CRM, …

• A lot of different tools used meanwhile (Mahout, HBase, Cassandra, Redis, Pig, Storm, …)

• Dreaming about something powerful and concise for Big Data…

• AD 2014: Head of Analytics & Data @ Toptal - researching new ways of doing Big Data Analytics, rediscovered Storm.

P.S. Ever considered doing Analytics & Data Science for a very cool startup? Drop me a note at: [email protected]

http://about.me/przemek.maciolek/

mailto:[email protected]

HADOOP IS COOL…

HADOOP IS COOL (BUT SOMETIMES IT’S NOT)

• High latency (interactive, anyone?)

• Challenging expressibility of business logic

• Iterative algorithms? (think: PageRank)

SOLUTION?

MapReduce

Pregel

Impala

PigS4 Storm

Drill

Giraph

…

General batch processing

Specialized systems

Hive

Data

Data

Data

Data

Data

Data

Data

Map Reduce

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

MAYBE MAP REDUCE IS NOT ALWAYS THE BEST SOLUTION?

GENERALIZE FTW!

MapReduce …

Batch processing

Specialized systems

Task DAG and Data Sharing

Spark

RESILIENT DISTRIBUTED DATASET (RDD)

• A collection of elements that can be operated in parallel

• Parallel Collection, e.g. sc.paralellize(Array(1,2,3))

• Hadoop Dataset

• Lazily evaluated, able to rebuild lost data any time

• Can be stored in memory without replication

TRANSFORMATIONS

• Creates a new dataset from an existing one

• Lazily evaluated

• Recomputed each time an action runs on it, but might be persisted (in memory or disk)

• Broadcast Variables and Accumulators for cluster-level sharing

ACTIONS

• Return the value to the driver after computation finishes

• Runs all required transformations

Scala, Java, Python!

HOW TO USE IT?scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 !scala> textFile.count() // Number of items in this RDD res0: Long = 74 !scala> textFile.first() // First item in this RDD res1: String = # Apache Spark !scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) // How many words are in the longest line res2: Int = 16 !scala> textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b).collect res3: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...

WHAT HAPPENS UNDERNEATH?

RDD Objects DAG Scheduler

Task SchedulerWorker

rdd.filter().map(…).groupBy(…).filter(…)

Split graph into stages of tasks. Submit each

one when ready.

Lunch tasks via cluster manager. Retry.

Execute tasks. Store and serve blocks.

TaskSet

Task

NARROW DEPENDENCIES

WIDE (SHUFFLE) DEPENDENCIES

map, filter

unionjoin (inputs not co-partitioned)

groupByKey

* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

How much code is needed to implement Big Data Page Rank?

* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf


* http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf


BERKELEY DATA ANALYTICS STACK

* https://amplab.cs.berkeley.edu/software/

https://amplab.cs.berkeley.edu/software/

SPARK LIVE

REFERENCES• http://spark.incubator.apache.org/

• https://amplab.cs.berkeley.edu/software/

• http://ampcamp.berkeley.edu/3/exercises/index.html

• http://www.mlbase.org/

• https://amplab.cs.berkeley.edu/benchmark/

• http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx

• http://spark-summit.org/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf

• http://spark-summit.org/wp-content/uploads/2013/10/Kay_Sparrow_Spark_Summit.pdf

• http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

• http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf

• http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

• http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

http://spark.incubator.apache.org/

https://amplab.cs.berkeley.edu/software/

http://ampcamp.berkeley.edu/3/exercises/index.html

http://www.mlbase.org/

https://amplab.cs.berkeley.edu/benchmark/

http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx

http://spark-summit.org/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf

http://spark-summit.org/wp-content/uploads/2013/10/Kay_Sparrow_Spark_Summit.pdf


http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf


http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Spark!

Technology

big data analytics

lost data

scala textfile

analytics data science

big data team

head of analytics data

apache spark

berkeley data analytics