Top Banner
EFFECTIVE SPARK WITH ALLUXIO Spark Summit EU Jiri Simsa, Alluxio, Inc.
39

Spark Summit EU talk by Jiri Simsa

Apr 16, 2017

Download

Data & Analytics

Spark Summit
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Spark Summit EU talk by Jiri Simsa

EFFECTIVE SPARK WITH ALLUXIO

Spark Summit EUJiri Simsa, Alluxio, Inc.

Page 2: Spark Summit EU talk by Jiri Simsa

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

2

Page 3: Spark Summit EU talk by Jiri Simsa

3

BIG DATA ECOSYSTEM YESTERDAY

Page 4: Spark Summit EU talk by Jiri Simsa

BIG DATA ECOSYSTEM TODAY

3

Page 5: Spark Summit EU talk by Jiri Simsa

BIG DATA ECOSYSTEM ISSUES

3

Page 6: Spark Summit EU talk by Jiri Simsa

BIG DATA ECOSYSTEM WITH ALLUXIO

FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

3

Page 7: Spark Summit EU talk by Jiri Simsa

BIG DATA ECOSYSTEM WITH ALLUXIO

FUSE Compatible File SystemHadoop Compatible File System Native Key-Value InterfaceNative File System

Unifying Data at Memory Speed

GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface

3

Page 8: Spark Summit EU talk by Jiri Simsa

WHY ALLUXIO

4

Co-located compute and data with memory-speed access to data

Virtualized across different storage systems under a unified namespace

Scale-out architecture

File system API, software only

Page 9: Spark Summit EU talk by Jiri Simsa

5

Unification

New workflows across any data in any storage system

Orders of magnitude improvement in run time

Choice in compute and storage – grow each independently, buy only what is needed

Performance Flexibility

BENEFITS

Page 10: Spark Summit EU talk by Jiri Simsa

6

Popular Open Source Projects’ Growth

Spark Kafka Cassandra HDFS

FASTEST GROWING BIG DATA PROJECTS

Page 11: Spark Summit EU talk by Jiri Simsa

6

• Fastest growing open-source project in the big data ecosystem

• 300+ contributors from 100+ organizations

• Running in large production clusters

Popular Open Source Projects’ Growth

Alluxio

Spark Kafka Cassandra HDFS

FASTEST GROWING BIG DATA PROJECTS

Page 12: Spark Summit EU talk by Jiri Simsa

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

7

Page 13: Spark Summit EU talk by Jiri Simsa

ACCELERATE I/O TO/FROM REMOTE STORAGE

Baidu’s PMs and analysts run

interactive queries to gain insights

into their products and business

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

ALLUXIO

Baidu File System

8

Page 14: Spark Summit EU talk by Jiri Simsa

ACCELERATE I/O TO/FROM REMOTE STORAGE

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.

- Baidu

RESULTS

• Data queries are now 30x faster with Alluxio

• Alluxio cluster runs stably, providing over 50TB of RAM space

• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds

Baidu’s PMs and analysts run

interactive queries to gain insights

into their products and business

• 200+ nodes deployment

• 2+ petabytes of storage

• Mix of memory + HDD

ALLUXIO

Baidu File System

8

Page 15: Spark Summit EU talk by Jiri Simsa

SHARE DATA ACROSS JOBS AT MEMORY SPEED

Barclays uses query and machine

learning to train models for risk

management

• 6 node deployment

• 1TB of storage

• Memory only

ALLUXIO

Relational Database

9

Page 16: Spark Summit EU talk by Jiri Simsa

SHARE DATA ACROSS JOBS AT MEMORY SPEED

Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.

- Barclays

RESULTS

• Barclays workflow iteration time decreased from hours to seconds

• Alluxio enabled workflows that were impossible before

• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds

Barclays uses query and machine

learning to train models for risk

management

• 6 node deployment

• 1TB of storage

• Memory only

ALLUXIO

Relational Database

9

Page 17: Spark Summit EU talk by Jiri Simsa

MANAGE DATA ACROSS STORAGE SYSTEMS

• 200+ nodes deployment

• 6 billion logs (4.5 TB) daily

• Mix of Memory + HDD

ALLUXIO

Qunar uses real-time machine

learning for their website ads.

10

Page 18: Spark Summit EU talk by Jiri Simsa

MANAGE DATA ACROSS STORAGE SYSTEMS

We’ve been running Alluxio in production for over 9 months, Alluxio’s unified namespace enable different applications and frameworks to easily interact with data from different storage systems.

- Qunar

RESULTS

• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing

• Improved the performance of their system with 15x – 300x speedups

• Tiered storage feature manages storage resources including memory and HDD

• 200+ nodes deployment

• 6 billion logs (4.5 TB) daily

• Mix of Memory + HDD

ALLUXIO

Qunar uses real-time machine

learning for their website ads.

10

Page 19: Spark Summit EU talk by Jiri Simsa

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

11

Page 20: Spark Summit EU talk by Jiri Simsa

CONSOLIDATING MEMORY

12

Storage Engine & Execution Engine Same Process

• Two copies of data in memory – double the memory used • Inter-process Sharing Slowed Down by Network / Disk I/O

Spark Compute

Spark Storage

block 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Spark Compute

Spark Storage

block 1

block 3

Page 21: Spark Summit EU talk by Jiri Simsa

CONSOLIDATING MEMORY

13

Storage Engine & Execution Engine Different process

• Half the memory used • Inter-process Sharing Happens at Memory Speed

Spark Compute

Spark Storage

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Spark Compute

Spark Storage

Page 22: Spark Summit EU talk by Jiri Simsa

DATA RESILIENCE DURING CRASH

14

Spark Compute

Spark Storageblock 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

Storage Engine & Execution Engine Same Process

Page 23: Spark Summit EU talk by Jiri Simsa

DATA RESILIENCE DURING CRASH

15

CRASH

Spark Storageblock 1

block 3

HDFS / Amazon S3block 1

block 3

block 2

block 4

• Process Crash Requires Network and/or Disk I/O to Re-read Data

Storage Engine & Execution Engine Same Process

Page 24: Spark Summit EU talk by Jiri Simsa

DATA RESILIENCE DURING CRASH

16

CRASH

HDFS / Amazon S3block 1

block 3

block 2

block 4

Storage Engine & Execution Engine Same Process

• Process Crash Requires Network and/or Disk I/O to Re-read Data

Page 25: Spark Summit EU talk by Jiri Simsa

DATA RESILIENCE DURING CRASH

17

Spark Compute

Spark Storage

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

Storage Engine & Execution Engine Different process

Page 26: Spark Summit EU talk by Jiri Simsa

DATA RESILIENCE DURING CRASH

18

• Process Crash – Data is Re-read at Memory Speed

HDFS / Amazon S3block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4Alluxio

block 1

block 3 block 4

CRASH Storage Engine & Execution Engine Different process

Page 27: Spark Summit EU talk by Jiri Simsa

ACCESSING ALLUXIO DATA FROM SPARK

19

Writing Data Write to an Alluxio file

Reading Data Read from an Alluxio file

Page 28: Spark Summit EU talk by Jiri Simsa

CODE EXAMPLE FOR SPARK RDDS

20

Writing RDD

to Alluxio

rdd.saveAsTextFile(alluxioPath) rdd.saveAsObjectFile(alluxioPath)

Reading RDD

from Alluxio

rdd = sc.textFile(alluxioPath) rdd = sc.objectFile(alluxioPath)

Page 29: Spark Summit EU talk by Jiri Simsa

CODE EXAMPLE FOR SPARK DATAFRAMES

21

Writing DataFrame

to Alluxiodf.write.parquet(alluxioPath)

Reading DataFrame

from Alluxiodf = sc.read.parquet(alluxioPath)

Page 30: Spark Summit EU talk by Jiri Simsa

OUTLINE

• Alluxio Overview

• Alluxio + Spark Use Cases

• Using Alluxio with Spark

• Performance Evaluation

22

Page 31: Spark Summit EU talk by Jiri Simsa

ENVIRONMENT

Spark 2.0.0 + Alluxio 1.2.0

Single r3.2xlarge instance (61GB RAM)

Comparisons: • Alluxio • Spark Storage Level: MEMORY_ONLY • Spark Storage Level: MEMORY_ONLY_SER • Spark Storage Level: DISK_ONLY

23

Page 32: Spark Summit EU talk by Jiri Simsa

Reading Cached RDDTi

me

[sec

onds

]

0

75

150

225

300

RDD Size [GB]

0 13 25 38 50

Alluxio (textFile) Alluxio (objectFile) DISK_ONLYMEMORY_ONLY_SER MEMORY_ONLY

24

Page 33: Spark Summit EU talk by Jiri Simsa

25

New Context: Read 50 GB RDD (SSD)

No Alluxio

Alluxio (objectFile)

Alluxio (textFile)

Time [seconds]

0 55 110 165 220

2x speedup

4x speedup

Page 34: Spark Summit EU talk by Jiri Simsa

26

New Context: Read 50 GB RDD (S3)

No Alluxio

Alluxio (objectFile)

Alluxio (textFile)

Time [seconds]

0 200 400 600 800

7x speedup

16x speedup

Page 35: Spark Summit EU talk by Jiri Simsa

27

Reading CACHED DATAFRAME (parquet)Ti

me

[sec

onds

]

0

62.5

125

187.5

250

DataFrame Size [GB]

0 13 25 38 50

Alluxio (textFile) DISK_ONLY MEMORY_ONLY_SERMEMORY_ONLY

Page 36: Spark Summit EU talk by Jiri Simsa

28

New Context: Read 50 GB DATAFRAME (SSD)

No Alluxio

Alluxio

Time [seconds]

0 60 120 180 240

2.5x speedup

Page 37: Spark Summit EU talk by Jiri Simsa

29

New CONTEXT: Read 50 GB DataFrame (S3)

No Alluxio

Alluxio

Time [seconds]

0 250 500 750 1000 1250 1500 1750

10x speedup (avg), 17x speedup (peak)

Page 38: Spark Summit EU talk by Jiri Simsa

CONCLUSION

30

• Easy to use with Spark

• Better Performance

• Predictable Performance

Page 39: Spark Summit EU talk by Jiri Simsa

Contact: [email protected]

Twitter: @jsimsa

Websites: www.alluxio.com and www.alluxio.org

Thank you!

31