Top Banner
By @przemur from
25

Spark!

Jan 26, 2015

Download

Technology

Overview of Spark - a new paradigm for processing Big Data - with astounding performance and conciseness.

Slides from DataKRK meetup.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark!

By @przemur from

Page 2: Spark!

HTTP://ABOUT.ME/PRZEMEK.MACIOLEK/• Data Scientist, Hadoop user since 2009

• Did research for Academia, data mined for oil&gas exploration industry, cofounded Data Science startup, built Big Data team in Base CRM, …

• A lot of different tools used meanwhile (Mahout, HBase, Cassandra, Redis, Pig, Storm, …)

• Dreaming about something powerful and concise for Big Data…

• AD 2014: Head of Analytics & Data @ Toptal - researching new ways of doing Big Data Analytics, rediscovered Storm.

P.S. Ever considered doing Analytics & Data Science for a very cool startup? Drop me a note at: [email protected]

Page 3: Spark!

HADOOP IS COOL…

Page 4: Spark!

HADOOP IS COOL (BUT SOMETIMES IT’S NOT)

• High latency (interactive, anyone?)

• Challenging expressibility of business logic

• Iterative algorithms? (think: PageRank)

Page 5: Spark!

SOLUTION?

MapReduce

Pregel

Impala

PigS4 Storm

Drill

Giraph

General batch processing

Specialized systems

Hive

Page 6: Spark!

Data

Data

Data

Data

Data

Data

Data

Map Reduce

Page 7: Spark!

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Page 8: Spark!

MAYBE MAP REDUCE IS NOT ALWAYS THE BEST SOLUTION?

Page 9: Spark!

GENERALIZE FTW!

MapReduce …

Batch processing

Specialized systems

Task DAG and Data Sharing

Spark

Page 10: Spark!

RESILIENT DISTRIBUTED DATASET (RDD)

• A collection of elements that can be operated in parallel

• Parallel Collection, e.g. sc.paralellize(Array(1,2,3))

• Hadoop Dataset

• Lazily evaluated, able to rebuild lost data any time

• Can be stored in memory without replication

Page 11: Spark!

TRANSFORMATIONS

• Creates a new dataset from an existing one

• Lazily evaluated

• Recomputed each time an action runs on it, but might be persisted (in memory or disk)

• Broadcast Variables and Accumulators for cluster-level sharing

ACTIONS

• Return the value to the driver after computation finishes

• Runs all required transformations

Page 12: Spark!

Scala, Java, Python!

Page 13: Spark!

HOW TO USE IT?scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 !scala> textFile.count() // Number of items in this RDD res0: Long = 74 !scala> textFile.first() // First item in this RDD res1: String = # Apache Spark !scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) // How many words are in the longest line res2: Int = 16 !scala> textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b).collect res3: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...

Page 14: Spark!

WHAT HAPPENS UNDERNEATH?

Page 15: Spark!

RDD Objects DAG Scheduler

Task SchedulerWorker

rdd.filter().map(…).groupBy(…).filter(…)

Split graph into stages of tasks. Submit each

one when ready.

Lunch tasks via cluster manager. Retry.

Execute tasks. Store and serve blocks.

TaskSet

Task

Page 16: Spark!

NARROW DEPENDENCIES

WIDE (SHUFFLE) DEPENDENCIES

map, filter

unionjoin (inputs not co-partitioned)

groupByKey

Page 17: Spark!

* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 18: Spark!

How much code is needed to implement Big Data Page Rank?

Page 19: Spark!

* http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 20: Spark!

* http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Page 21: Spark!

* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Page 22: Spark!

* http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Page 23: Spark!

BERKELEY DATA ANALYTICS STACK

* https://amplab.cs.berkeley.edu/software/

Page 24: Spark!

SPARK LIVE

Page 25: Spark!

REFERENCES• http://spark.incubator.apache.org/

• https://amplab.cs.berkeley.edu/software/

• http://ampcamp.berkeley.edu/3/exercises/index.html

• http://www.mlbase.org/

• https://amplab.cs.berkeley.edu/benchmark/

• http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx

• http://spark-summit.org/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf

• http://spark-summit.org/wp-content/uploads/2013/10/Kay_Sparrow_Spark_Summit.pdf

• http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

• http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf

• http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

• http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf