© 2015 IBM Corporation JVM, OS レレレレレレレレレレレレレ Spark レレレレレレレレレレレレ Feb. 8, 2016 Tatsuhiro Chiba ([email protected] IBM Research - Toky
© 2015 IBM Corporation
JVM, OS レベルのチューニングによるSpark アプリケーションの最適化
Feb. 8, 2016 Tatsuhiro Chiba ([email protected])
IBM Research - Tokyo
© 2016 IBM CorporationHadoop / Spark Conference Japan 20162
Performance Innovation Laboratory, IBM Research - Tokyo
Who am I ?
Tatsuhiro Chiba ( 千葉 立寛 )
Staff Researcher at IBM Research – Tokyo
Research Interests– Parallel Distributed System and Middleware – Parallel Distributed Programming Language– High Performance Computing
Twitter: @tatsuhiro
Today’s contents appear in, – 付録 D in “Spark による実践データ解析” - O’reilly Japan– “Workload Characterization and Optimization of TPC-H Queries on
Apache Spark”, IBM Research Reports.
© 2016 IBM CorporationHadoop / Spark Conference Japan 20163
Performance Innovation Laboratory, IBM Research - Tokyo
Summary – after applying JVM and OS tuning
Machine Spec : CPU: POWER8 3.3GHz(2Sockets x 12cores), Memory: 1TB, Disk: 1TB OS: Ubuntu 14.10(Kernel: 3.16.0-31-generic)Optimized JVM Option : -Xmx24g –Xms24g –Xmn12g -Xgcthreads12 -Xtrace:none –Xnoloa –XlockReservation –Xgcthreads6 –Xnocompactgc –Xdisableexplicitgc -XX:-RuntimeInstrumentation –XlpExecutor JVMs : 4OS Settings : NUMA aware affinity=enabled, large page=enabledSpark Version : 1.4.1JVM Version : java version “1.8.0” (IBM J9 VM, build pxl6480sr2-20151023_01(SR2))
Q1 Q3 Q5 Q9 kmeans0
50
100
150
200
250
300
350
400
450
-50.0%
-45.0%
-40.0%
-35.0%
-30.0%
-25.0%
-20.0%
-15.0%
-10.0%
-5.0%
0.0%
original optimized speedup (%)Ex
ecut
ion
TIm
e (s
ec.)
© 2016 IBM CorporationHadoop / Spark Conference Japan 2016
Performance Innovation Laboratory, IBM Research - Tokyo
Benchmark 1 – Kmeans// input data is cachedval data = sc.textFile(“file:///tmp/kmeans-data”, 2)val parsedData = data.map(s => Vectors.dense( s.split(' ').map(_.toDouble))).persist() // run Kmeans with varying # of clustersval bestK = (100,1)for (k <- 2 to 11) { val clusters = new KMeans() .setK(k).setMaxIterations(5) .setRuns(1).setInitializationMode("random") .setEpsilon(1e-30).run(parsedData)// evaluate val error = clusters.computeCost(parsedData) if (bestK._1 > error) { bestK = (errors,k) }}
Kmeans
Kmeans application– Varied clustering number ‘K’ for the same dataset– The first Kmeans job takes much time due to data loading into memory
Synthetic data generator program– Used BigDataBench published at http://prof.ict.ac.cn/– Generated 6GB dataset which includes over 65M data points
© 2016 IBM CorporationHadoop / Spark Conference Japan 20165
Performance Innovation Laboratory, IBM Research - Tokyo
Benchmark 2 - TPC-H
TPC-H Benchmark on Spark SQL– TPC-H is often used for SQL on Hadoop system – Spark SQL can run Hive QL directly through hiveserver2 (thrift server) and
beeline (JDBC client)– We modified TPC-H Queries published at https://github.com/rxin/TPC-H-Hive
Table data generator– Used DBGEN program and generated 100GB dataset (scale factor = 100)– Loaded data into Hive tables with Parquet format and Snappy compression
select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice*(1-l_discount)) as sum_disc_price, sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_orderfrom lineitemwhere l_shipdate <= '1998-09-01'group by l_returnflag, l_linestatusorder by l_returnflag, l_linestatus;
TPC-H Q1 (Hive)
© 2016 IBM CorporationHadoop / Spark Conference Japan 20166
Performance Innovation Laboratory, IBM Research - Tokyo
Machine & Software Spec and Spark Settings
Processor # Core SMT Memory OS
POWER83.30 GHz * 2
24 cores(2 sockets * 12 cores)
8(total 192 hardware threads)
1TB Ubuntu14.10 (kernel 3.16.0-31)
Xeon E5-2699 v32.30 GHz
36 cores(2 sockets x 18 cores)
2(total 72 hardware threads)
755GB Ubuntu15.04 (kernel 3.19.0-26)
software version
Spark 1.4.1, 1.5.2, 1.6.0
Hadoop (HDFS) 2.6.0
Java 1.8.0 (IBM J9 VM SR2)
Scala 2.10.4
Default Spark Settings– # of Executor JVMs: 1– # of worker threads: 48– Total Heap size: 192GB (nursery = 48g, tenure = 144g)
© 2016 IBM CorporationHadoop / Spark Conference Japan 20167
Performance Innovation Laboratory, IBM Research - Tokyo
JVM Tuning – Heap Space Sizing Garbage Collection tuning
points– GC algorithms– GC threads– Heap sizing
Heap sizing is simplest way to reduce GC overhead
– Bigger young space helps to achieve over 30% improvement
But, small old space may cause many global GC
– Cached RDD stays in Java heap
Xmn48g Xmn96g Xmn144g Xmn48g Xmn96g Xmn144gKmeans TPC-H Q9
0
50
100
150
200
250
300
350
400
450
Exec
utio
n Ti
me
(sec
.)
Young Space(-Xmn)
Execution Time (sec)
GC ratio (%) Minor GC Avg. pause time
Minor GC Major GC
48g (default) 400 s 20 % 2.1 s 39 1
96g 306 s 18 % 3.4 s 22 1
144g 300 s 14 % 3.6 s 14 0
© 2016 IBM CorporationHadoop / Spark Conference Japan 20168
Performance Innovation Laboratory, IBM Research - Tokyo
JVM Tuning – Other Options
JVM options tuning point– Monitor threads tuning– GC tuning– Java thread tuning– JIT tuning , etc.
Result– Proper JVM options helps to
improve application performance over 20% option 0 option 1 option 2 option 3 option 4
0
20
40
60
80
100
120
-25.0%
-20.0%
-15.0%
-10.0%
-5.0%
0.0%
Q1 Q5 speedup Q1 (%)
Exec
utio
n TI
me
(sec
.)
# JVM Options
Option 0(baseline)
-Xmn96g –Xdump:heap:none –Xdump:system:none -XX:+RuntimeInstrumentation-agentpath:/path/to/libjvmti_oprofile.so -verbose:gc –Xverbosegclog:/tmp/gc.log -Xjit:verbose={compileStart,compileEnd},vlog=/tmp/jit.log
Option 1(Monitor)
Option 0 + “-Xtrace:none”
Option 2(GC)
Option 1 + “-Xgcthreads48 –Xnoloa –Xnocompactgc –Xdisableexplicitgc”
Option 3(Thread)
Option 2 + “-XlockReservation”
Option 4(JIT)
Option 3 + “-XX:-RuntimeInstrumentation”
© 2016 IBM CorporationHadoop / Spark Conference Japan 20169
Performance Innovation Laboratory, IBM Research - Tokyo
JVM Tuning – JVM Counts
Experiment– Kept # worker threads and total heap– Changed # Executor JVMs– 1JVM : 48 worker threads & 192GB
heap– 2JVMs : 24 worker threads & 96GB
heap– 4JVMs : 12 worker threads & 48GB
heap Result
– Using a single big Executor JVM is not always best
– By dividing into smaller JVMs, • Helps to reduce GC overhead• Helps to reduce resource
contention Kmeans case
– Performance gap comes from the first Kmeans job, especially from data loading
– After loading RDD in memory, computation performance is similar
Q1 Q3 Q5 Q9 Kmeans0
50
100
150
200
250
300
-15%
-10%
-5%
0%
5%
1JVM 2JVM 4JVM2JVM (%) 4JVM (%)
Exec
utio
n Ti
me
(sec
.)
impr
ovem
ent
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
90
100
1JVM
2JVM
4JVM
Kmeans Clustering Job Iterations (K = 2, 3, .. 11)
Exec
utio
n Ti
me
(sec
.)
© 2016 IBM CorporationHadoop / Spark Conference Japan 201610
Performance Innovation Laboratory, IBM Research - Tokyo
OS Tuning – NUMA aware process affinity
Setting NUMA aware process affinity to each Executor JVM helps to speed-up
– By reducing scheduling overhead– By reducing cache miss and stall cycles
Result– Achieved 3 – 14% improvement in all benchmarks without any bad effects
NUMA1NUMA0 NUMA2 NUMA3
JVM 012thread
s
JVM 112thread
s
JVM 212threads
JVM 312threads
Socket 0 Socket 1Pro
cess
ors
DRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAM
numactl -c [0-7],[8-15],[16-23],[24-31],[32-39],[40-47]
Spark Executor JVMs
Q1 Q5 Q9 Kmeans0
50
100
150
200
250
-16.0%
-14.0%
-12.0%
-10.0%
-8.0%
-6.0%
-4.0%
-2.0%
0.0%
NUMA off NUMA on speedup (%)
Exec
utio
n TI
me
(sec
.)
© 2016 IBM CorporationHadoop / Spark Conference Japan 201611
Performance Innovation Laboratory, IBM Research - Tokyo
OS Tuning – Large Page
How to use large page– reserve large page on Linux by
changing kernel parameter– Append “-Xlp” to Executor JVM
option
Result– Achieved 3 – 5 % improvement PageSize=64K PageSize=16M PageSize=64K PageSize=16M
NUMA off NUMA on
020406080
100120140160180200
Kmeans
Exec
utio
n Ti
me
(sec
.)
© 2016 IBM CorporationHadoop / Spark Conference Japan 201612
Performance Innovation Laboratory, IBM Research - Tokyo
Comparison of Default and Optimized w/ 1.4.1, 1.5.2, and 1.6.0
Newer version basically achieved good performance
JVM & OS tuning are still helpful to improve Spark performance
Tungsten & other new features (e.g. Unified Memory Management) can reduce GC overhead drastically
1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0Q1 Q3 Q5
0
20
40
60
80
100
120
140
160default optimized
Exe
cutio
n Ti
me
(sec
.)
1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0 1.4.1 1.5.2 1.6.0Q9 Q19 Q21
0
50
100
150
200
250
300
350default optimized
Exe
cutio
n Ti
me
(sec
.)
711632