Making sense of performance and identifying stragglers in Data Analytics Framework CSCI 8780 Advanced Distributed Systems Manish Ranjan and Narita Pandhe
Apr 14, 2017
Making sense of performance and identifying stragglers inData Analytics Framework
CSCI 8780 Advanced Distributed Systems
Manish Ranjan and Narita Pandhe
Introduction
- Large-scale data analytics has become widespread
- Research devoted to improving the performance of data analytics frameworks
- BUT comparatively little effort : spent in identifying the performance bottlenecks!!
2
More resource efficient
Faster
3
4
5
6
7
8
9
Experiments
10
What Cluster Configuration did we use?
- #1 Master, #6 Slaves
- Master Config- 64 - Bit,
- 8GB RAM,
- 2 Cores,
- 50GB SSD
- Slaves Config(each):- 64 - Bit
- 2GB RAM,
- 1 Core,
- 30GB SSD
Config related modifications: eg. Replication + SSDs
11
First Benchmarking namenode
To first test Namenode hardware and config: NNBench
What it does:
Generates a lot of HDFS related requests
Why it does:
To put a “HIGH” HDFS management stress on the namenode
How it does:
Simulates request for creating, reading, renaming and deleting files on HDFS
12
What Workload did we use?
- TeraSort benchmark suite
- Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as fast as possible.
- Limited by our cluster configuration, we performed several experiments with data of size 1GB, 5GB and 10GB.
- TeraSort benchmark can be utilized to iron out your Hadoop configuration
13
14
Hadoop
i-6c76c1da (M), i-40684ef0
(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)
15
i-6c76c1da (M), i-40684ef0
(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)
Red : s6Dark Green: s4
16
i-6c76c1da (M), i-40684ef0
(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)
Observations for 10GB
Red : s6Dark Green: s4
17
i-6c76c1da (M), i-40684ef0
(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)
Observations for 10GB
Red : s6Dark Green: s4
18
i-6c76c1da (M), i-40684ef0
(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)
Identified Stragglers
19
Spark
i-6c76c1da (M), i-40684ef0
(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)
Orange: s2Red: s6
20
Hadoop SparkRed s6Bright Blue :
s5Orange : s2
Conclusions- Straggler task spends an unusually long amount of time in a particular part
of task execution.
- It usually not too hard to found a straggler for a specific execution- what is hard is to get it consistently enough!
- Though we were lucky enough to spot few even in a mediocre strength cluster. Which emphasizes the necessity of understanding the cluster meta info well.
Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection
- Since, Spark:
- often breaks jobs into many more tasks
- has much lower task launch overhead than Hadoop
21
References- Making Sense of Performance in Data Analytics Frameworks,
Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI,
VMware, Seoul National University- No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf- http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-ha
doop-cluster-with-terasort-testdfsio-nnbench-mrbench/- https://github.com/ehiggs/spark-terasort- aws.amazon.com
22
23
24