JVM and OS Tuning for accelerating Spark application

© 2015 IBM Corporation

JVM, OS レベルのチューニングによるSpark アプリケーションの最適化

Feb. 8, 2016 Tatsuhiro Chiba ([email protected])

IBM Research - Tokyo

mailto:[email protected])

© 2016 IBM CorporationHadoop / Spark Conference Japan 20162

Performance Innovation Laboratory, IBM Research - Tokyo

Who am I ?

Tatsuhiro Chiba ( 千葉立寛 )

Staff Researcher at IBM Research – Tokyo

Research Interests– Parallel Distributed System and Middleware – Parallel Distributed Programming Language– High Performance Computing

Twitter: @tatsuhiro

Today’s contents appear in, – 付録 D in “Spark による実践データ解析” - O’reilly Japan– “Workload Characterization and Optimization of TPC-H Queries on

Apache Spark”, IBM Research Reports.



Summary – after applying JVM and OS tuning

Machine Spec : CPU: POWER8 3.3GHz(2Sockets x 12cores), Memory: 1TB, Disk: 1TB OS: Ubuntu 14.10(Kernel: 3.16.0-31-generic)Optimized JVM Option : -Xmx24g –Xms24g –Xmn12g -Xgcthreads12 -Xtrace:none –Xnoloa –XlockReservation –Xgcthreads6 –Xnocompactgc –Xdisableexplicitgc -XX:-RuntimeInstrumentation –XlpExecutor JVMs : 4OS Settings : NUMA aware affinity=enabled, large page=enabledSpark Version : 1.4.1JVM Version : java version “1.8.0” (IBM J9 VM, build pxl6480sr2-20151023_01(SR2))

Q1 Q3 Q5 Q9 kmeans0

50

100

150

200

250

300

350

400

450

-50.0%

-45.0%

-40.0%

-35.0%

-30.0%

-25.0%

-20.0%

-15.0%

-10.0%

-5.0%

0.0%

original optimized speedup (%)Ex

ecut

ion

TIm

e (s

ec.)



Benchmark 1 – Kmeans// input data is cachedval data = sc.textFile(“file:///tmp/kmeans-data”, 2)val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).persist() // run Kmeans with varying # of clustersval bestK = (100,1)for (k <- 2 to 11) { val clusters = new KMeans() .setK(k).setMaxIterations(5) .setRuns(1).setInitializationMode("random") .setEpsilon(1e-30).run(parsedData)// evaluate val error = clusters.computeCost(parsedData) if (bestK._1 > error) { bestK = (errors,k) }}

Kmeans

Kmeans application– Varied clustering number ‘K’ for the same dataset– The first Kmeans job takes much time due to data loading into memory

Synthetic data generator program– Used BigDataBench published at http://prof.ict.ac.cn/– Generated 6GB dataset which includes over 65M data points

http://prof.ict.ac.cn/



Benchmark 2 - TPC-H

TPC-H Benchmark on Spark SQL– TPC-H is often used for SQL on Hadoop system – Spark SQL can run Hive QL directly through hiveserver2 (thrift server) and

beeline (JDBC client)– We modified TPC-H Queries published at https://github.com/rxin/TPC-H-Hive

Table data generator– Used DBGEN program and generated 100GB dataset (scale factor = 100)– Loaded data into Hive tables with Parquet format and Snappy compression

select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_orderfrom lineitemwhere l_shipdate <= '1998-09-01'group by l_returnflag, l_linestatusorder by l_returnflag, l_linestatus;

TPC-H Q1 (Hive)

https://github.com/rxin/TPC-H-Hive





Machine & Software Spec and Spark Settings

Processor # Core SMT Memory OS

POWER83.30 GHz * 2

24 cores(2 sockets * 12 cores)

8(total 192 hardware threads)

1TB Ubuntu14.10 (kernel 3.16.0-31)

Xeon E5-2699 v32.30 GHz

36 cores(2 sockets x 18 cores)

2(total 72 hardware threads)

755GB Ubuntu15.04 (kernel 3.19.0-26)

software version

Spark 1.4.1, 1.5.2, 1.6.0

Hadoop (HDFS) 2.6.0

Java 1.8.0 (IBM J9 VM SR2)

Scala 2.10.4

Default Spark Settings– # of Executor JVMs: 1– # of worker threads: 48– Total Heap size: 192GB (nursery = 48g, tenure = 144g)



JVM Tuning – Heap Space Sizing Garbage Collection tuning

points– GC algorithms– GC threads– Heap sizing

Heap sizing is simplest way to reduce GC overhead

– Bigger young space helps to achieve over 30% improvement

But, small old space may cause many global GC

– Cached RDD stays in Java heap

Xmn48g Xmn96g Xmn144g Xmn48g Xmn96g Xmn144gKmeans TPC-H Q9

0

50

100

150

200

250

300

350

400

450

Exec

utio

n Ti

me

(sec

.)

Young Space(-Xmn)

Execution Time (sec)

GC ratio (%) Minor GC Avg. pause time

Minor GC Major GC

48g (default) 400 s 20 % 2.1 s 39 1

96g 306 s 18 % 3.4 s 22 1

144g 300 s 14 % 3.6 s 14 0



JVM Tuning – Other Options

JVM options tuning point– Monitor threads tuning– GC tuning– Java thread tuning– JIT tuning , etc.

Result– Proper JVM options helps to

improve application performance over 20% option 0 option 1 option 2 option 3 option 4

0

20

40

60

80

100

120

-25.0%

-20.0%

-15.0%

-10.0%

-5.0%

0.0%

Q1 Q5 speedup Q1 (%)

Exec

utio

n TI

me

(sec

.)

# JVM Options

Option 0(baseline)

-Xmn96g –Xdump:heap:none –Xdump:system:none -XX:+RuntimeInstrumentation-agentpath:/path/to/libjvmti_oprofile.so -verbose:gc –Xverbosegclog:/tmp/gc.log -Xjit:verbose={compileStart,compileEnd},vlog=/tmp/jit.log

Option 1(Monitor)

Option 0 + “-Xtrace:none”

Option 2(GC)

Option 1 + “-Xgcthreads48 –Xnoloa –Xnocompactgc –Xdisableexplicitgc”

Option 3(Thread)

Option 2 + “-XlockReservation”

Option 4(JIT)

Option 3 + “-XX:-RuntimeInstrumentation”



JVM Tuning – JVM Counts

Experiment– Kept # worker threads and total heap– Changed # Executor JVMs– 1JVM : 48 worker threads & 192GB

heap– 2JVMs : 24 worker threads & 96GB

heap– 4JVMs : 12 worker threads & 48GB

heap Result

– Using a single big Executor JVM is not always best

– By dividing into smaller JVMs, • Helps to reduce GC overhead• Helps to reduce resource

contention Kmeans case

– Performance gap comes from the first Kmeans job, especially from data loading

– After loading RDD in memory, computation performance is similar

Q1 Q3 Q5 Q9 Kmeans0

50

100

150

200

250

300

-15%

-10%

-5%

0%

5%

1JVM 2JVM 4JVM2JVM (%) 4JVM (%)

Exec

utio

n Ti

me

(sec

.)

impr

ovem

ent

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

1JVM

2JVM

4JVM

Kmeans Clustering Job Iterations (K = 2, 3, .. 11)

Exec

utio

n Ti

me

(sec

.)



OS Tuning – NUMA aware process affinity

Setting NUMA aware process affinity to each Executor JVM helps to speed-up

– By reducing scheduling overhead– By reducing cache miss and stall cycles

Result– Achieved 3 – 14% improvement in all benchmarks without any bad effects

NUMA1NUMA0 NUMA2 NUMA3

JVM 012thread

s

JVM 112thread

s

JVM 212threads

JVM 312threads

Socket 0 Socket 1Pro

cess

ors

DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM

numactl -c [0-7],[8-15],[16-23],[24-31],[32-39],[40-47]

Spark Executor JVMs

Q1 Q5 Q9 Kmeans0

50

100

150

200

250

-16.0%

-14.0%

-12.0%

-10.0%

-8.0%

-6.0%

-4.0%

-2.0%

0.0%

NUMA off NUMA on speedup (%)

Exec

utio

n TI

me

(sec

.)



OS Tuning – Large Page

How to use large page– reserve large page on Linux by

changing kernel parameter– Append “-Xlp” to Executor JVM

option

Result– Achieved 3 – 5 % improvement PageSize=64K PageSize=16M PageSize=64K PageSize=16M

NUMA off NUMA on

020406080

100120140160180200

Kmeans

Exec

utio

n Ti

me

(sec

.)



Comparison of Default and Optimized w/ 1.4.1, 1.5.2, and 1.6.0

Newer version basically achieved good performance

JVM & OS tuning are still helpful to improve Spark performance

Tungsten & other new features (e.g. Unified Memory Management) can reduce GC overhead drastically

1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0Q1 Q3 Q5

0

20

40

60

80

100

120

140

160default optimized

Exe

cutio

n Ti

me

(sec

.)

1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0Q9 Q19 Q21

0

50

100

150

200

250

300

350default optimized

Exe

cutio

n Ti

me

(sec

.)

711632