Debugging & Tuning in Spark

Debugging & Tuning in Spark

Shiao-An Yuan@sayuan

2016-08-11

Spark Overview

● Cluster Manager (aka Master)● Worker (aka Slave)

● Driver● Executor

http://spark.apache.org/docs/latest/cluster-overview.html



RDD (Resilient Distributed Dataset)

A fault-tolerant collection of elements that can be operated on in parallel

Word Count

val sc: SparkContext = ...

val result = sc.textFile(file) // RDD[String]

.flatMap(_.split(" ")) // RDD[String]

.map(_ -> 1) // RDD[(String, Int)]

.groupByKey() // RDD[(String, Iterable[Int])]

.map(x => (x._1, x._2.sum)) // RDD[(String, Int)]

.collect() // Array[(String, Int])

Lazy, Transformation, Action, Job

groupByKey mapmapflatMap collect

Partition, Shuffle


Stage, Task


DAG (Directed Acyclic Graph)

● RDD operations○ Transformation○ Action

● Lazy● Job● Shuffle● Stage● Partition● Task

Objective

1. A correct and parallelizable algorithm2. Parallelism3. Reduce the overhead from parallelization

Correctness and Parallelizable

● Use small input● Run locally

○ --master local○ --master local[4]○ --master local[*]

Non-RDD Operations

● Avoid long blocking on driver

Data Skew

● repartition() come to rescue?● Hotspots

○ Choose another partitioned key○ Filter unreasonable data

● Trace to it’s source

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html






Prefer reduceByKey() over groupByKey()

● reduceByKey() combines output before shuffling the data

● Also consider aggregateByKey()● Use groupByKey() if you really

know what you are doing

Shuffle Spill

● Increase partition count● spark.shuffle.spill=false (default since Spark 1.6)● spark.shuffle.memoryFraction● spark.executor.memory

http://www.slideshare.net/databricks/new-developments-in-spark



Join

● partitionBy()● repartitionAndSortWithinPartitions()● spark.sql.autoBroadcastJoinThreshold (default 10 MB)● Join it manually by mapPartitions()

○ Broadcast small RDD■ http://stackoverflow.com/a/17690254/406803

○ Query data from database■ https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/63ILfPqPRYI/discussion

http://stackoverflow.com/a/17690254/406803

http://stackoverflow.com/a/17690254/406803

https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/63ILfPqPRYI/discussion

https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/63ILfPqPRYI/discussion

Broadcast Small RDD

val smallRdd = ...

val largeRdd = ...

val smallBroadcast = sc.broadcast(smallRdd.collectAsMap())

val joined = largeRdd.mapPartitions(iter => {

val m = smallBroadcast.value

for {

(k, v) <- iter

if m.contains(k)

} yield (k, (v, m.get(k).get))

}, preservesPartitioning = true)

Query Data from Cassandra

val conf = new SparkConf()

.set("spark.cassandra.connection.host", "127.0.0.1")

val connector = CassandraConnector(conf)

val joined = rdd.mapPartitions(iter => {

connector.withSessionDo(session => {

val stmt = session.prepare("SELECT value FROM table WHERE key=?")

iter.map {

case (k, v) => (k, (v, session.execute(stmt.bind(k)).one()))

}

})

})

Persist

● Storage level○ MEMORY_ONLY○ MEMORY_AND_DISK○ MEMORY_ONLY_SER○ MEMORY_AND_DISK_SER○ DISK_ONLY○ …

● Kryo serialization○ Much faster○ Registration needed

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose



Common Failures

● Large shuffle blocks○ java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

■ Increase partition count○ MetadataFetchFailedException, FetchFailedException

■ Increase partition count■ Increase `spark.executor.memory`■ …

○ java.lang.OutOfMemoryError: GC overhead over limit exceeded■ May caused by shuffle spill

java.lang.OutOfMemoryError: Java heap space

● Driver○ Increase `spark.driver.memory`○ collect()

■ take()■ saveAsTextFile()

● Executor○ Increase `spark.executor.memory`○ More nodes

java.io.IOException: No space left on device

● SPARK_WORKER_DIR● SPARK_LOCAL_DIRS, spark.local.dir● Shuffle files

○ Only delete after the RDD object has been GC

Other Tips

● Event logs○ spark.eventLog.enabled=true○ ${SPARK_HOME}/sbin/start-history-server.sh

Partitions

● Rule of thumb: ~128 MB per partition● If #partitions <= 2000, but close, bump to just > 2000

● Increase #partitions by repartition()● Decrease #partitions by coalesce()● spark.sql.shuffle.partitions (default 200)

http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications



Executors, Cores, Memory!?

● 32 nodes● 16 cores each● 64 GB of RAM each● If you have an application need 32 cores, what is the

correct setting?




Why Spark Debugging / Tuning is Hard?

● Distributed● Lazy● Hard to do benchmark● Spark is sensitive

Conclusion

● When in doubt, repartition!● Avoid shuffle if you can● Choose a reasonable partition count● Premature optimization is the root of all evil -- Donald Knuth

Reference

● Tuning and Debugging in Apache Spark● Top 5 Mistakes to Avoid When Writing Apache Spark

Applications ● How-to: Tune Your Apache Spark Jobs (Part 1)● How-to: Tune Your Apache Spark Jobs (Part 2)

http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark

http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark








http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Debugging & Tuning in Spark

Data & Analytics