What’s New in Spark 0.6 and Shark 0.2

What’s New in Spark 0.6 and Shark 0.2November 5, 2012

UC BERKELEYwww.spark-project.org

AgendaIntro & Spark 0.6 tour (Matei Zaharia)Standalone deploy mode (Denny Britz)Shark 0.2 (Reynold Xin)Q & A

What Are Spark & Shark?Spark: fast cluster computing engine based on general operators & in-memory computingShark: Hive-compatible data warehouse system built on Spark

Both are open source projects from the UCBerkeley AMP Lab

What is the AMP Lab?60-person lab focusing on big dataFunded by NSF, DARPA, 18 companiesGoal: build an open-source, next-generation analytics stack

UC BERKELEY Spark

Mesos

Shark Stre

ami

ngGr

aph

Hado

op, M

PI. .

.

. . .

Lear

nin

g

Some Exciting NewsRecently, three full-time developers joined AMP to work on these projectsAlso encourage outside contributions!

»This release: Shark server (Yahoo!), improved accumulators (Quantifind)

Spark 0.6 ReleaseBiggest release so far in terms of featuresBiggest in terms of developers (18 total, 12 new)Focus areas: ease-of-use and performance

Ease-of-UseSpark already had good traction despite two fairly researchy aspects

»Scala language»Requirement to run on Mesos

A big goal was to improve these:»Java API (and upcoming API in Python)»Simpler deployment (standalone mode,

YARN)

Java APIlines.filter(_.contains(“error”)).count()

JavaRDD<String> lines = sc.textFile(...);

lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Java API FeaturesSupports all existing Spark features

»RDDs, accumulators, broadcast variables

Retains type safety through specific classes for RDDs of special types

»E.g. JavaPairRDD<K, V> for key-value pairs

Using Key-Value Pairsimport scala.Tuple2;

JavaRDD<String> words = ...;

JavaPairRDD<String, Integer> ones = words.map( new PairFunction<String, String, Integer> { public Tuple2<String, Integer> call(String s) { return new Tuple2(s, 1); } });

// Can now call ones.reduceByKey(), groupByKey(), etc

More info: spark-project.org/docs/0.6.0/

http://www.spark-project.org/docs/0.6.0/



Coming Next: PySparklines = sc.textFile(sys.argv[1])

counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)

Simpler DeploymentRefactored Spark’s scheduler to allow running on different cluster managersDenny will talk about the standalone mode…

Other Ease-of-Use WorkDocumentation

»Big effort to improve Spark’s help and Scaladoc

Debugging hints (pointers to user code in logs)Maven Central artifacts

spark-project.org/documentation.html

http://www.spark-project.org/documentation.html

http://www.spark-project.org/documentation.html

PerformanceNew ConnectionManager and BlockManager

»Replace simple HTTP shuffle with faster, async NIO

Faster control-plane (task scheduling & launch)Per-RDD control of storage level

Some Graphs

020406080

100120

Spark 0.5

Runn

ing

tim

e (m

inut

es)

Large User App(2000 maps / 1000 reduces)

0100200300400500600700800900

1000Spark 0.5

Runn

ing

tim

e (m

s)

Wikipedia Search Demo

Per-RDD Storage Levelimport spark.storage.StorageLevelval data = file.map(...)

// Keep in memory, recompute when out of space// (default behavior with cache())data.persist(StorageLevel.MEMORY_ONLY)

// Drop to disk instead of recomputingdata.persist(StorageLevel.MEMORY_AND_DISK)

// Serialize in-memory datadata.persist(StorageLevel.MEMORY_ONLY_SER)

CompatibilityWe’ve always strived to stay source-compatible!Only change in this release is in configuration: spark.cache.class replaced with per-RDD levels

Shark 0.2Hive compatibility improvementsThrift server modePerformance improvementsSimpler deployment (comes with Spark 0.6)

Hive CompatibilityHive 0.9 supportFull UDF/UDAF supportADD FILE support for running scriptsUser-supplied jars using ADD JAR

Thrift ServerContributed by Yahoo!, compatible with Hive Thrift serverEnable multiple clients share cached tablesBI tool integration (e.g. Tableau)

Performance

010203040506070

Shark 0.1

Runn

ing

Tim

e (s

ecs)

Group By(1B items, 150M distinct)

0

50

100

150

200

250Shark 0.1

Runn

ing

Tim

e (s

ecs)

Join(1B join 150M)

Shark 0.3 PreviewIn-memory columnar compression (dictionary encoding, run length encoding, etc)Map pruningJVM bytecode generation for expression evalsPersist cached table meta data across sessions

Spark 0.7+Spark StreamingPySpark: Python API for SparkMemory monitoring dashboard

What’s New in Spark 0.6 and Shark 0.2

Documents

spark shark

orgagendaintro spark

existing spark featuresrdds

flatmaplambda x

reducebykeylambda x

memory computingshark

memory datadata

rdd levels shark