Top Banner
Debugging & Tuning in Spark Shiao-An Yuan @sayuan 2016-08-11
34

Debugging & Tuning in Spark

Jan 24, 2018

Download

Data & Analytics

Shiao-An Yuan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Debugging & Tuning in Spark

Debugging & Tuning in Spark

Shiao-An Yuan@sayuan

2016-08-11

Page 2: Debugging & Tuning in Spark

Spark Overview

● Cluster Manager (aka Master)● Worker (aka Slave)

● Driver● Executor

http://spark.apache.org/docs/latest/cluster-overview.html

Page 3: Debugging & Tuning in Spark

RDD (Resilient Distributed Dataset)

A fault-tolerant collection of elements that can be operated on in parallel

Page 4: Debugging & Tuning in Spark

Word Count

val sc: SparkContext = ...

val result = sc.textFile(file) // RDD[String]

.flatMap(_.split(" ")) // RDD[String]

.map(_ -> 1) // RDD[(String, Int)]

.groupByKey() // RDD[(String, Iterable[Int])]

.map(x => (x._1, x._2.sum)) // RDD[(String, Int)]

.collect() // Array[(String, Int])

Page 5: Debugging & Tuning in Spark

Lazy, Transformation, Action, Job

groupByKey mapmapflatMap collect

Page 6: Debugging & Tuning in Spark

Partition, Shuffle

groupByKey mapmapflatMap collect

Page 7: Debugging & Tuning in Spark

Stage, Task

groupByKey mapmapflatMap collect

Page 8: Debugging & Tuning in Spark

DAG (Directed Acyclic Graph)

● RDD operations○ Transformation○ Action

● Lazy● Job● Shuffle● Stage● Partition● Task

Page 9: Debugging & Tuning in Spark

Objective

1. A correct and parallelizable algorithm2. Parallelism3. Reduce the overhead from parallelization

Page 10: Debugging & Tuning in Spark

Correctness and Parallelizable

● Use small input● Run locally

○ --master local○ --master local[4]○ --master local[*]

Page 11: Debugging & Tuning in Spark
Page 12: Debugging & Tuning in Spark

Non-RDD Operations

● Avoid long blocking on driver

Page 13: Debugging & Tuning in Spark
Page 14: Debugging & Tuning in Spark

Data Skew

● repartition() come to rescue?● Hotspots

○ Choose another partitioned key○ Filter unreasonable data

● Trace to it’s source

Page 17: Debugging & Tuning in Spark

Prefer reduceByKey() over groupByKey()

● reduceByKey() combines output before shuffling the data

● Also consider aggregateByKey()● Use groupByKey() if you really

know what you are doing

Page 18: Debugging & Tuning in Spark
Page 19: Debugging & Tuning in Spark

Shuffle Spill

● Increase partition count● spark.shuffle.spill=false (default since Spark 1.6)● spark.shuffle.memoryFraction● spark.executor.memory

Page 21: Debugging & Tuning in Spark

Join

● partitionBy()● repartitionAndSortWithinPartitions()● spark.sql.autoBroadcastJoinThreshold (default 10 MB)● Join it manually by mapPartitions()

○ Broadcast small RDD■ http://stackoverflow.com/a/17690254/406803

○ Query data from database■ https://groups.google.com/a/lists.datastax.com/d/topic/spark-connector-user/63ILfPqPRYI/discussion

Page 22: Debugging & Tuning in Spark

Broadcast Small RDD

val smallRdd = ...

val largeRdd = ...

val smallBroadcast = sc.broadcast(smallRdd.collectAsMap())

val joined = largeRdd.mapPartitions(iter => {

val m = smallBroadcast.value

for {

(k, v) <- iter

if m.contains(k)

} yield (k, (v, m.get(k).get))

}, preservesPartitioning = true)

Page 23: Debugging & Tuning in Spark

Query Data from Cassandra

val conf = new SparkConf()

.set("spark.cassandra.connection.host", "127.0.0.1")

val connector = CassandraConnector(conf)

val joined = rdd.mapPartitions(iter => {

connector.withSessionDo(session => {

val stmt = session.prepare("SELECT value FROM table WHERE key=?")

iter.map {

case (k, v) => (k, (v, session.execute(stmt.bind(k)).one()))

}

})

})

Page 24: Debugging & Tuning in Spark
Page 25: Debugging & Tuning in Spark

Persist

● Storage level○ MEMORY_ONLY○ MEMORY_AND_DISK○ MEMORY_ONLY_SER○ MEMORY_AND_DISK_SER○ DISK_ONLY○ …

● Kryo serialization○ Much faster○ Registration needed

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

Page 26: Debugging & Tuning in Spark

Common Failures

● Large shuffle blocks○ java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

■ Increase partition count○ MetadataFetchFailedException, FetchFailedException

■ Increase partition count■ Increase `spark.executor.memory`■ …

○ java.lang.OutOfMemoryError: GC overhead over limit exceeded■ May caused by shuffle spill

Page 27: Debugging & Tuning in Spark

java.lang.OutOfMemoryError: Java heap space

● Driver○ Increase `spark.driver.memory`○ collect()

■ take()■ saveAsTextFile()

● Executor○ Increase `spark.executor.memory`○ More nodes

Page 28: Debugging & Tuning in Spark

java.io.IOException: No space left on device

● SPARK_WORKER_DIR● SPARK_LOCAL_DIRS, spark.local.dir● Shuffle files

○ Only delete after the RDD object has been GC

Page 29: Debugging & Tuning in Spark

Other Tips

● Event logs○ spark.eventLog.enabled=true○ ${SPARK_HOME}/sbin/start-history-server.sh

Page 30: Debugging & Tuning in Spark

Partitions

● Rule of thumb: ~128 MB per partition● If #partitions <= 2000, but close, bump to just > 2000

● Increase #partitions by repartition()● Decrease #partitions by coalesce()● spark.sql.shuffle.partitions (default 200)

http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications

Page 31: Debugging & Tuning in Spark

Executors, Cores, Memory!?

● 32 nodes● 16 cores each● 64 GB of RAM each● If you have an application need 32 cores, what is the

correct setting?

http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications

Page 32: Debugging & Tuning in Spark

Why Spark Debugging / Tuning is Hard?

● Distributed● Lazy● Hard to do benchmark● Spark is sensitive

Page 33: Debugging & Tuning in Spark

Conclusion

● When in doubt, repartition!● Avoid shuffle if you can● Choose a reasonable partition count● Premature optimization is the root of all evil -- Donald Knuth