What’s New in Spark 0.6 and Shark 0.2 November 5, 2012 UC BERKELEY www.spark-project.org
Feb 22, 2016
What’s New in Spark 0.6 and Shark 0.2November 5, 2012
UC BERKELEYwww.spark-project.org
AgendaIntro & Spark 0.6 tour (Matei Zaharia)Standalone deploy mode (Denny Britz)Shark 0.2 (Reynold Xin)Q & A
What Are Spark & Shark?Spark: fast cluster computing engine based on general operators & in-memory computingShark: Hive-compatible data warehouse system built on Spark
Both are open source projects from the UCBerkeley AMP Lab
What is the AMP Lab?60-person lab focusing on big dataFunded by NSF, DARPA, 18 companiesGoal: build an open-source, next-generation analytics stack
UC BERKELEY Spark
Mesos
Shark Stre
ami
ngGr
aph
Hado
op, M
PI. .
.
. . .
Lear
nin
g
Some Exciting NewsRecently, three full-time developers joined AMP to work on these projectsAlso encourage outside contributions!
»This release: Shark server (Yahoo!), improved accumulators (Quantifind)
Spark 0.6 ReleaseBiggest release so far in terms of featuresBiggest in terms of developers (18 total, 12 new)Focus areas: ease-of-use and performance
Ease-of-UseSpark already had good traction despite two fairly researchy aspects
»Scala language»Requirement to run on Mesos
A big goal was to improve these:»Java API (and upcoming API in Python)»Simpler deployment (standalone mode,
YARN)
Java APIlines.filter(_.contains(“error”)).count()
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
Java API FeaturesSupports all existing Spark features
»RDDs, accumulators, broadcast variables
Retains type safety through specific classes for RDDs of special types
»E.g. JavaPairRDD<K, V> for key-value pairs
Using Key-Value Pairsimport scala.Tuple2;
JavaRDD<String> words = ...;
JavaPairRDD<String, Integer> ones = words.map( new PairFunction<String, String, Integer> { public Tuple2<String, Integer> call(String s) { return new Tuple2(s, 1); } });
// Can now call ones.reduceByKey(), groupByKey(), etc
More info: spark-project.org/docs/0.6.0/
Coming Next: PySparklines = sc.textFile(sys.argv[1])
counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)
Simpler DeploymentRefactored Spark’s scheduler to allow running on different cluster managersDenny will talk about the standalone mode…
Other Ease-of-Use WorkDocumentation
»Big effort to improve Spark’s help and Scaladoc
Debugging hints (pointers to user code in logs)Maven Central artifacts
spark-project.org/documentation.html
PerformanceNew ConnectionManager and BlockManager
»Replace simple HTTP shuffle with faster, async NIO
Faster control-plane (task scheduling & launch)Per-RDD control of storage level
Some Graphs
020406080
100120
Spark 0.5
Runn
ing
tim
e (m
inut
es)
Large User App(2000 maps / 1000 reduces)
0100200300400500600700800900
1000Spark 0.5
Runn
ing
tim
e (m
s)
Wikipedia Search Demo
Per-RDD Storage Levelimport spark.storage.StorageLevelval data = file.map(...)
// Keep in memory, recompute when out of space// (default behavior with cache())data.persist(StorageLevel.MEMORY_ONLY)
// Drop to disk instead of recomputingdata.persist(StorageLevel.MEMORY_AND_DISK)
// Serialize in-memory datadata.persist(StorageLevel.MEMORY_ONLY_SER)
CompatibilityWe’ve always strived to stay source-compatible!Only change in this release is in configuration: spark.cache.class replaced with per-RDD levels
Shark 0.2Hive compatibility improvementsThrift server modePerformance improvementsSimpler deployment (comes with Spark 0.6)
Hive CompatibilityHive 0.9 supportFull UDF/UDAF supportADD FILE support for running scriptsUser-supplied jars using ADD JAR
Thrift ServerContributed by Yahoo!, compatible with Hive Thrift serverEnable multiple clients share cached tablesBI tool integration (e.g. Tableau)
Performance
010203040506070
Shark 0.1
Runn
ing
Tim
e (s
ecs)
Group By(1B items, 150M distinct)
0
50
100
150
200
250Shark 0.1
Runn
ing
Tim
e (s
ecs)
Join(1B join 150M)
Shark 0.3 PreviewIn-memory columnar compression (dictionary encoding, run length encoding, etc)Map pruningJVM bytecode generation for expression evalsPersist cached table meta data across sessions
Spark 0.7+Spark StreamingPySpark: Python API for SparkMemory monitoring dashboard