Top Banner
Reproducible distributed experiments on cloud vs Shelan Perera Ashansa Perera Kamal Hakimzadeh
30

Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Jan 08, 2017

Download

Technology

Shelan Perera
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Reproducible distributed experiments on cloud

vs

Shelan PereraAshansa Perera Kamal Hakimzadeh

Page 2: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

“Reproducing experiments

with

minimal effort

Page 3: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Spark and Flink

▷ Batch Processing vs. Stream Processing

▷ Micro Batching vs. Natural Data Flow

▷ Good fit for scalable deployment in the cloud

Page 4: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Motivation

▷ Validate Performance claims

▷ Take off deployment overhead

▷ Design reproducible experiments

Page 5: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Karamel =>“Framework for reproducible distributed experiments”

Page 6: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Benchmark - Batch

Teragen - To generate data (Hadoop)

Terasort - Benchmarking Algorithm(Spark, Flink)

Page 8: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

1 Namenode ⇒ Master(Low processing )

2 Worker nodes ⇒ Slaves(High processing )

Our Deployment

Page 9: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

EC2

Slave

EC2

Master

Deployment

HadoopName Node

SparkMaster

Flink JobManager

SparkWorker

Flink TaskManager

HadoopData Node

Karamel

x 2Karamel Config

Page 10: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

ConfigurationMaster /

Namenode

2.6

4

16

80

Slave / Worker

2.5

16

122

1600

CPU (GHz)

No of vCPUs

Memory (GB)

Storage :SSD (GB)

(m3.xlarge) (i2.4xlarge)

Page 11: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

ExperimentHadoop MR : Teragen

HDFS

Spark/Flink : Terasort

200/ 400/ 600 GB

Page 12: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Results

Batch Processing

Page 13: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Application Performance

Page 14: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Flink

1.5 x Faster than Spark

Page 15: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

▷ Spark : Does not overlap stages

▷ Flink : Do pipelining

Mainly because...

Page 16: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Collectl- Monitor

● Tool used to collect and draw results.

● https://github.com/shelan/collectl-monitoring

Page 17: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

System Performance -CPU (%)

Page 18: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

System Performance -Memory (GB)

Page 19: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

System Performance -Disk (MB/s)

Page 20: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

System Performance -Network (KB/s)

Page 21: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Load Balancing -Workers (CPU %)

Page 22: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Load Balancing -Workers (CPU %)

Page 23: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Outcome

▷ Performance Comparison Results

▷ Karamel experiments to reproduce the same results with minimal effort

Page 24: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

How not to reproduce

“our problems”

Page 25: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

EC2 claims 800 GB disks, But Disk File system (DF) does shows only 30GB.

If you are using I2 or R3 instances you should create a file system and partition disks manually.

Page 26: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Large Spark or Flink Batch applications can fail with not enough disk space

Configure Flink temp directory and Spark local directory to a partition with at least enough space to store the total input.

Page 27: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Reproducing experiments on EC2 may cost you a lot

Spot instances which allow to reduce the cost by 10x is also supported by Karamel

Page 28: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

IncompatibleClassChangeError when running StreamBench built for MR2 on hadoop2.x

No explicitly defined dependencies for previous versions, but one of the dependencies (mahout) had internal references to hadoop1.x jar

Page 29: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Summary

▷ Introducing reproducible experiments on cloud

▷ Performance Comparison of Spark and Flink

▷ Reproducible experiments are available online (https://github.com/karamel-lab)

Page 30: Apache Flink vs Apache Spark - Reproducible experiments on cloud.

Thanks ..!!