UTILIZING ACCELERATORS TO SPEEDUP ETL, ML, AND DL … · Spark SQL/DF GraphX Streaming MLlib. 7 SPARK 3.X IS A UNIFIED AI PLATFORM END-TO-END APACHE SPARK 3.0 PIPELINE CLUSTER MANAGEMENT/DEPLOYMENT

Jason Lowe and Robert Evans, 05/19/2020

UTILIZING ACCELERATORS TO SPEEDUP ETL, ML, AND DL APPLICATIONS

2

Accelerated ETL

Accelerated SQL/Dataframe

Accelerated Shuffle

What's Next

AGENDA

3

ACCELERATED ETL?

https://www.piqsels.com/en/public-domain-photo-zrkia

Can a GPU make an elephant fast?

4

YESTPCx—BB Like Benchmark Results (10TB Dataset, Two Nodes DGX-2 Cluster)*

Environment: Two DGX-2 (96 CPU Cores, 1.5TB Host memory, 16 V100 GPUs, 512 GB GPU Memory)

* Not official or complete TPCx-BB runs (ETL power only).

Query #5 Query #16 Query #21 Query #22

CPU 25.95 6.16 7.13 3.80

GPU 1.31 1.16 0.56 0.14

0.00

5.00

10.00

15.00

20.00

25.00

30.00Ti

me

(min

s)Query Time: GPU vs CPU (Mins)

5

MODERN ML/DL WORKFLOW

Load Transform

Data

Sources Data

Store

Ingest

CPU Compute

Model

Training

GPU Compute

Training

6

APACHE SPARK 2.X

CLUSTER MANAGEMENT/DEPLOYMENT (YARN, K8S, Standalone)

DISTRIBUTED, SCALE-OUT DATA SCIENCE AND AI APPLICATIONS

CPU Infrastructure

ACCELERATED ML/DL FRAMEWORKS

XGBoost TensorFlow

PyTorch Horovod

SPARK 2.x CORE

APACHE SPARK COMPONENTS

Spark

SQL/DFGraphX

Streaming MLlib

7

SPARK 3.X IS A UNIFIED AI PLATFORM

END-TO-END APACHE SPARK 3.0 PIPELINE

CLUSTER MANAGEMENT/DEPLOYMENT (YARN, K8S, Standalone)

DISTRIBUTED, SCALE-OUT DATA SCIENCE AND AI APPLICATIONS

GPU-Accelerated Infrastructure

ACCELERATED ML/DL FRAMEWORKS

XGBoost TensorFlow

PyTorch Horovod

SPARK 3.0 CORE

APACHE SPARK COMPONENTS

Spark SQL/DF GraphX

Streaming MLlib

RAPIDS Accelerator for Apache Spark

8

ETL + ML/DL WORKFLOW

Load Transform

Data

Sources Data

Store

Ingest

Model

Training

GPU Compute

9

DEEP LEARNING RECOMMENDATION MACHINES

Anonymized 7-day clickstream dataset (1 TB)

Convert high cardinality string categorical data to contiguous integer ids

DLRM github repo has scripts for this out of the box

https://medium.com/analytics-vidhya/deep-learning-recommendation-machines-dlrm-4fec2a5e7ef8

Example use case: Criteo dataset

10

DLRM ON CRITEO DATASET (PAST)

* Extrapolated couldn’t convince anyone to wait that long

144.0

12.1

45.0

0.70.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

ETL (1 core CPU)* Spark ETL (96 core CPU) Training (96 core CPU) Training (1 - V100)

Tim

e (

Ho

urs

)

ETL & Training Run Time for CPU & GPU CRITEO DATASET (1TB)

11

DLRM ETL ON CRITEO DATASET (PRESENT)

12.1

2.3

0.50.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

Spark ETL (96 core CPU) Spark ETL (1 - V100) Spark ETL (8 - V100)

Tim

e (H

ou

rs)

Spark ETL for CRITEO DATASET (1TB)

12

DLRM END-TO-END ON CRITEO DATASET (PRESENT)

Original CPU (1 Core forETL, 96 Core CPU for

Training)

Spark CPU (96 Core for ETL& Training)

Spark CPU (96 Core for ETL)& Spark GPU (1-V100

Training)

Spark GPU (8-V100 for ETL& 1-V100 Training)

Training 45.0 45.0 0.7 0.7

ETL 144.0 12.1 12.1 0.5

144.0

12.1 12.10.5

45.0

45.00.7

0.7

0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

Tim

e (H

ou

rs)

Spark ETL + Training for Criteo Dataset (1TB)

160x faster than original48x faster than CPU (4% the cost)

10x faster than typical (1/6th the cost)

13

“The more you buy, the more you

save.”

— Jensen Huang, GTC 2018

14

RAPIDS ACCELERATOR FOR APACHE SPARK (PLUGIN)

UCX Libraries RAPIDS C++ Libraries

CUDA

JNI bindings

Mapping From Java/Scala to C++

RAPIDS Accelerator

for Spark

DISTRIBUTED SCALE-OUT SPARK APPLICATIONS

Spark SQL API Spark ShuffleDataFrame API

if gpu_enabled(operation, data_type)

call-out to RAPIDS

else

execute standard Spark operation

JNI bindings

Mapping From Java/Scala to C++

● Custom Implementation of Spark

Shuffle

● Optimized to use RDMA and GPU-to-

GPU direct communication

APACHE SPARK CORE

15

SQL/DATAFRAME PLUGIN

16

No Code Changes

Same SQL and Dataframe code.

(none)

17

WHAT WE SUPPORT

!

%

&

*

+

-

/

<

<=

<=>

=

==

>

>=

^

abs

acos

and

asin

atan

avg

bigint

boolean

cast

cbrt

ceil

ceiling

coalesce

concat

cos

cosh

cot

count

cube

current_date

current_timestamp

date

datediff

day

dayofmonth

degrees

double

e

exp

expm1

first

first_value

float

floor

from_unixtime

hour

if

ifnull

in

initcap

input_file_block_length

input_file_block_start

input_file_name

int

isnan

isnotnull

isnull

last

last_value

lcase

like

ln

locate

log

log10

log1p

log2

lower

max

mean

min

minute

mod

monotonically_increasing_id

month

nanvl

negative

not

now

nullif

nvl

nvl2

or

pi

posexplode*

position

pow

power

radians

rand*

regexp_replace*

replace

rint

rollup

row_number

second

shiftleft

shiftright

shiftrightunsigned

sign

signum

sin

sinh

smallint

spark_partition_id

sqrt

string

substr

substring

sum

tan

tanh

timestamp

tinyint

trim

ucase

upper

when

window

year

|

~

CSV Reading*

Orc Reading

Orc Writing

Parquet Reading

Parquet Writing

ANSI casts

TimeSub for time ranges

startswith

endswith

contains

limit

order by

group by

filter

union

repartition

equi-joins

select

and growing…

18

IS THIS A SILVER BULLET?

Small amounts of data

Few hundred MB per partition for GPU

Highly cache coherent processing

Data Movement

Slow I/O (networking, disks, etc.)

Going back and forth to the CPU (UDFs)

Shuffle

Limited GPU Memory

NO

160

550

1250

3500

12288

24576 2560046080

307200

1048576

MB/s

(Log S

cale

)

19

BUT IT CAN BE AMAZING

High cardinality joins

High cardinality aggregates

High cardinality sort

Window operations (especially on large windows)

Complicated processing

Transcoding (Writing Parquet and ORC is hard, reading CSV is hard)

What the SQL plugin excels at

20

HOW DOES IT WORK

21

SPARK SQL & DATAFRAME COMPILATION FLOW

DataFrame

Logical Plan

Physical Plan

RDD[InternalRow]

bar.groupBy(col(”product_id”),col(“ds”))

.agg(max(col(“price”)) -

min(col(“price”)).alias(“range”))

SELECT product_id, ds, max(price) – min(price) AS range FROM bar GROUP BY product_id, ds

QUERY

CPU

PH

YSIC

AL P

LAN

22

SPARK SQL & DATAFRAME COMPILATION FLOW

DataFrame

Logical Plan

Physical Plan

RDD[InternalRow]

bar.groupBy(col(”product_id”),col(“ds”))

.agg(max(col(“price”)) -

min(col(“price”)).alias(“range”))

SELECT product_id, ds, max(price) – min(price) AS range FROM bar GROUP BY product_id, ds

QUERY

GPU

PH

YSIC

AL P

LAN

GPU Physical Plan

RDD[ColumnarBatch]

RAPIDS SQLPlugin

23

SPARK SQL & DATAFRAME COMPILATION FLOWG

PU

PH

YSIC

AL P

LAN

CPU

PH

YSIC

AL P

LAN

Read Parquet File

First Stage

Aggregate

Shuffle Exchange

Second Stage

Aggregate

Write Parquet File

Combine Shuffle

Data

Read Parquet File

First Stage

Aggregate

Shuffle Exchange

Second Stage

Aggregate

Write Parquet File

Convert to Row

Format

Convert to Row

Format

24

ETL TECHNOLOGY STACK

Dask cuDFcuDF, Pandas

Python

Cython

cuDF C++

CUDA Libraries

CUDA

Java

JNI bindings

Spark dataframes, Scala, PySpark

25

ACCELERATED SHUFFLE

26

SPARK SHUFFLEData Exchange Between Stages

Task 1Task 0 Task 2

Task 1Task 0

Stage 1

Stage 2

27

SPARK SHUFFLECPU-Centric Data Movement

PCI-e Bus

Local

StorageNetworkGPU 1

CPU

GPU 0

28

ACCELERATED SPARK SHUFFLEGPU-Centric Data Movement

PCI-e Bus

Local

StorageNetworkGPU 1

CPU

GPU 0

NVLink

GPU DirectStorage

RDMA

29

ACCELERATED SPARK SHUFFLEShuffling Spilled Data

PCI-e Bus

Local

StorageNetworkGPU 1

CPU

GPU 0

RDMA

Host

Memory

30

UCX LIBRARY

Abstracts communication transports

Selects best available route(s) between endpoints

TCP, RDMA, Shared Memory, GPU

Zero-copy GPU memory transfers over RDMA

RDMA requires network support (IB or RoCE)

http://openucx.org

Unified Communication X

http://openucx.org/

31

ACCELERATED SHUFFLE RESULTSInventory Pricing Query

CPU GPU GPU+UCX

Series1 228 45 8.4

0

50

100

150

200

250

Qu

ery

Du

rati

on

in S

eco

nd

s

32

ACCELERATED SHUFFLE RESULTSETL for Logistical Regression Model

CPU GPU GPU+UCX

Series1 1556 172 79

0

200

400

600

800

1000

1200

1400

1600

1800

Qu

ery

Du

rati

on

in S

eco

nd

s

33

WHAT’S NEXT?

34

WHAT’S NEXT

Open Source/Spark 3.0 Release

Nested types Arrays, Structs, and Maps

Decimal type

More operators

GPU Direct Storage

Time zone support for timestamps(only UTC for now)

Higher order functions

UDFs

COMING SOON

FURTHER OUT

35

WHERE TO GET MORE INFOLearn more about the RAPIDS Accelerator for Apache Spark

Visit: NVIDIA.com/Spark

Please use the “contact us” to get in touch with NVIDIA’s Spark team

Listen to how Adobe Email Marketing Intelligent Services leverages the RAPIDS Accelerator & Spark 3.0 on Databricks

Upcoming Spark+AI Summit Sessions on GPU support for Apache Spark 3.0:

Deep Dive into GPU Support in Apache Spark 3.x

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters

Preview of Spark 3.0 GPU Features: NVIDIA.com/Spark-Book

QUESTIONS

http://nvidia.com/Spark

https://databricks.com/session_na20/deep-dive-into-gpu-support-in-apache-spark-3-x

https://databricks.com/session_na20/scalable-acceleration-of-xgboost-training-on-apache-spark-gpu-clusters

37

BACKUP SLIDES

38

FAQS

Q: What are the minimum requirements?

A: The RAPIDS accelerator requires:

Apache Spark 3.0

RAPIDS cudf 0.14

CUDA 10.1 or later

NVIDIA GPU with Pascal architecture or later

Ubuntu 16.04+ or CentOS 7+

39

FAQS

Q: Do all cluster nodes require GPUs?

A: All Spark executors running with the RAPIDS accelerator require their own GPU.

The Spark driver process does not require a node with a GPU.

Q: Can I run more than one executor per GPU?

A: No, there must be a one-to-one mapping between Spark executors and GPUs.

You can run more than one concurrent task per executor.

40

FAQS

Q: Will the RAPIDS accelerator work in the cloud?

A: Yes, if the VM environment meets the minimum requirements.

Q: Will the RAPIDS accelerator be available for Apache Spark 2.x?

A: No. The columnar processing APIs added in Apache Spark 3.0 are required.

Q: How can I tell if an operation is being accelerated?

A: Accelerated operations appear in the query explanation and SQL UI.

41

RAPIDS ACCELERATOR CONFIGURATION

spark.rapids.sql.enabled is the master enable

spark.rapids.sql.explain enables logging of operations not accelerated

spark.rapids.sql.concurrentGpuTasks controls concurrent task count per GPU

42

SPARK ACCELERATOR-AWARE SCHEDULING

Tracking JIRA: SPARK-24615

Request executor and driver resources (GPU, FPGA, etc.)

Resource discovery

Specify task resources

API to determine assigned resources

YARN, Kubernetes, and Standalone

43

SPARK ACCELERATOR-AWARE SCHEDULING

./bin/spark-shell --master yarn --executor-cores 2 \

--conf spark.driver.resource.gpu.amount=1 \

--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh \

--conf spark.executor.resource.gpu.amount=2 \

--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh \

--conf spark.task.resource.gpu.amount=1 \

--files examples/src/main/scripts/getGpusResources.sh

Sample Command-Line

44

SPARK STAGE LEVEL SCHEDULING

Tracking JIRA: SPARK-27495

Specify task resource requirements per RDD operation

Dynamically allocates containers to meet resource requirements

Schedules tasks on appropriate containers

UTILIZING ACCELERATORS TO SPEEDUP ETL, ML, AND DL … · Spark SQL/DF GraphX Streaming MLlib. 7 SPARK 3.X IS A UNIFIED AI PLATFORM END-TO-END APACHE SPARK 3.0 PIPELINE CLUSTER MANAGEMENT/DEPLOYMENT

Documents