Jason Lowe and Robert Evans, 05/19/2020 UTILIZING ACCELERATORS TO SPEEDUP ETL, ML, AND DL APPLICATIONS
Jason Lowe and Robert Evans, 05/19/2020
UTILIZING ACCELERATORS TO SPEEDUP ETL, ML, AND DL APPLICATIONS
2
Accelerated ETL
Accelerated SQL/Dataframe
Accelerated Shuffle
What's Next
AGENDA
3
ACCELERATED ETL?
https://www.piqsels.com/en/public-domain-photo-zrkia
Can a GPU make an elephant fast?
4
YESTPCx—BB Like Benchmark Results (10TB Dataset, Two Nodes DGX-2 Cluster)*
Environment: Two DGX-2 (96 CPU Cores, 1.5TB Host memory, 16 V100 GPUs, 512 GB GPU Memory)
* Not official or complete TPCx-BB runs (ETL power only).
Query #5 Query #16 Query #21 Query #22
CPU 25.95 6.16 7.13 3.80
GPU 1.31 1.16 0.56 0.14
0.00
5.00
10.00
15.00
20.00
25.00
30.00Ti
me
(min
s)Query Time: GPU vs CPU (Mins)
5
MODERN ML/DL WORKFLOW
Load Transform
Data
Sources Data
Store
Ingest
CPU Compute
Model
Training
GPU Compute
Training
6
APACHE SPARK 2.X
CLUSTER MANAGEMENT/DEPLOYMENT (YARN, K8S, Standalone)
DISTRIBUTED, SCALE-OUT DATA SCIENCE AND AI APPLICATIONS
CPU Infrastructure
ACCELERATED ML/DL FRAMEWORKS
XGBoost TensorFlow
PyTorch Horovod
SPARK 2.x CORE
APACHE SPARK COMPONENTS
Spark
SQL/DFGraphX
Streaming MLlib
7
SPARK 3.X IS A UNIFIED AI PLATFORM
END-TO-END APACHE SPARK 3.0 PIPELINE
CLUSTER MANAGEMENT/DEPLOYMENT (YARN, K8S, Standalone)
DISTRIBUTED, SCALE-OUT DATA SCIENCE AND AI APPLICATIONS
GPU-Accelerated Infrastructure
ACCELERATED ML/DL FRAMEWORKS
XGBoost TensorFlow
PyTorch Horovod
SPARK 3.0 CORE
APACHE SPARK COMPONENTS
Spark SQL/DF GraphX
Streaming MLlib
RAPIDS Accelerator for Apache Spark
8
ETL + ML/DL WORKFLOW
Load Transform
Data
Sources Data
Store
Ingest
Model
Training
GPU Compute
9
DEEP LEARNING RECOMMENDATION MACHINES
Anonymized 7-day clickstream dataset (1 TB)
Convert high cardinality string categorical data to contiguous integer ids
DLRM github repo has scripts for this out of the box
https://medium.com/analytics-vidhya/deep-learning-recommendation-machines-dlrm-4fec2a5e7ef8
Example use case: Criteo dataset
10
DLRM ON CRITEO DATASET (PAST)
* Extrapolated couldn’t convince anyone to wait that long
144.0
12.1
45.0
0.70.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
ETL (1 core CPU)* Spark ETL (96 core CPU) Training (96 core CPU) Training (1 - V100)
Tim
e (
Ho
urs
)
ETL & Training Run Time for CPU & GPU CRITEO DATASET (1TB)
11
DLRM ETL ON CRITEO DATASET (PRESENT)
12.1
2.3
0.50.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Spark ETL (96 core CPU) Spark ETL (1 - V100) Spark ETL (8 - V100)
Tim
e (H
ou
rs)
Spark ETL for CRITEO DATASET (1TB)
12
DLRM END-TO-END ON CRITEO DATASET (PRESENT)
Original CPU (1 Core forETL, 96 Core CPU for
Training)
Spark CPU (96 Core for ETL& Training)
Spark CPU (96 Core for ETL)& Spark GPU (1-V100
Training)
Spark GPU (8-V100 for ETL& 1-V100 Training)
Training 45.0 45.0 0.7 0.7
ETL 144.0 12.1 12.1 0.5
144.0
12.1 12.10.5
45.0
45.00.7
0.7
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
Tim
e (H
ou
rs)
Spark ETL + Training for Criteo Dataset (1TB)
160x faster than original48x faster than CPU (4% the cost)
10x faster than typical (1/6th the cost)
13
“The more you buy, the more you
save.”
— Jensen Huang, GTC 2018
14
RAPIDS ACCELERATOR FOR APACHE SPARK (PLUGIN)
UCX Libraries RAPIDS C++ Libraries
CUDA
JNI bindings
Mapping From Java/Scala to C++
RAPIDS Accelerator
for Spark
DISTRIBUTED SCALE-OUT SPARK APPLICATIONS
Spark SQL API Spark ShuffleDataFrame API
if gpu_enabled(operation, data_type)
call-out to RAPIDS
else
execute standard Spark operation
JNI bindings
Mapping From Java/Scala to C++
● Custom Implementation of Spark
Shuffle
● Optimized to use RDMA and GPU-to-
GPU direct communication
APACHE SPARK CORE
15
SQL/DATAFRAME PLUGIN
16
No Code Changes
Same SQL and Dataframe code.
(none)
17
WHAT WE SUPPORT
!
%
&
*
+
-
/
<
<=
<=>
=
==
>
>=
^
abs
acos
and
asin
atan
avg
bigint
boolean
cast
cbrt
ceil
ceiling
coalesce
concat
cos
cosh
cot
count
cube
current_date
current_timestamp
date
datediff
day
dayofmonth
degrees
double
e
exp
expm1
first
first_value
float
floor
from_unixtime
hour
if
ifnull
in
initcap
input_file_block_length
input_file_block_start
input_file_name
int
isnan
isnotnull
isnull
last
last_value
lcase
like
ln
locate
log
log10
log1p
log2
lower
max
mean
min
minute
mod
monotonically_increasing_id
month
nanvl
negative
not
now
nullif
nvl
nvl2
or
pi
posexplode*
position
pow
power
radians
rand*
regexp_replace*
replace
rint
rollup
row_number
second
shiftleft
shiftright
shiftrightunsigned
sign
signum
sin
sinh
smallint
spark_partition_id
sqrt
string
substr
substring
sum
tan
tanh
timestamp
tinyint
trim
ucase
upper
when
window
year
|
~
CSV Reading*
Orc Reading
Orc Writing
Parquet Reading
Parquet Writing
ANSI casts
TimeSub for time ranges
startswith
endswith
contains
limit
order by
group by
filter
union
repartition
equi-joins
select
and growing…
18
IS THIS A SILVER BULLET?
Small amounts of data
Few hundred MB per partition for GPU
Highly cache coherent processing
Data Movement
Slow I/O (networking, disks, etc.)
Going back and forth to the CPU (UDFs)
Shuffle
Limited GPU Memory
NO
160
550
1250
3500
12288
24576 2560046080
307200
1048576
MB/s
(Log S
cale
)
19
BUT IT CAN BE AMAZING
High cardinality joins
High cardinality aggregates
High cardinality sort
Window operations (especially on large windows)
Complicated processing
Transcoding (Writing Parquet and ORC is hard, reading CSV is hard)
What the SQL plugin excels at
20
HOW DOES IT WORK
21
SPARK SQL & DATAFRAME COMPILATION FLOW
DataFrame
Logical Plan
Physical Plan
RDD[InternalRow]
bar.groupBy(col(”product_id”),col(“ds”))
.agg(max(col(“price”)) -
min(col(“price”)).alias(“range”))
SELECT product_id, ds, max(price) – min(price) AS range FROM bar GROUP BY product_id, ds
QUERY
CPU
PH
YSIC
AL P
LAN
22
SPARK SQL & DATAFRAME COMPILATION FLOW
DataFrame
Logical Plan
Physical Plan
RDD[InternalRow]
bar.groupBy(col(”product_id”),col(“ds”))
.agg(max(col(“price”)) -
min(col(“price”)).alias(“range”))
SELECT product_id, ds, max(price) – min(price) AS range FROM bar GROUP BY product_id, ds
QUERY
GPU
PH
YSIC
AL P
LAN
GPU Physical Plan
RDD[ColumnarBatch]
RAPIDS SQLPlugin
23
SPARK SQL & DATAFRAME COMPILATION FLOWG
PU
PH
YSIC
AL P
LAN
CPU
PH
YSIC
AL P
LAN
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Combine Shuffle
Data
Read Parquet File
First Stage
Aggregate
Shuffle Exchange
Second Stage
Aggregate
Write Parquet File
Convert to Row
Format
Convert to Row
Format
24
ETL TECHNOLOGY STACK
Dask cuDFcuDF, Pandas
Python
Cython
cuDF C++
CUDA Libraries
CUDA
Java
JNI bindings
Spark dataframes, Scala, PySpark
25
ACCELERATED SHUFFLE
26
SPARK SHUFFLEData Exchange Between Stages
Task 1Task 0 Task 2
Task 1Task 0
Stage 1
Stage 2
27
SPARK SHUFFLECPU-Centric Data Movement
PCI-e Bus
Local
StorageNetworkGPU 1
CPU
GPU 0
28
ACCELERATED SPARK SHUFFLEGPU-Centric Data Movement
PCI-e Bus
Local
StorageNetworkGPU 1
CPU
GPU 0
NVLink
GPU DirectStorage
RDMA
29
ACCELERATED SPARK SHUFFLEShuffling Spilled Data
PCI-e Bus
Local
StorageNetworkGPU 1
CPU
GPU 0
RDMA
Host
Memory
30
UCX LIBRARY
Abstracts communication transports
Selects best available route(s) between endpoints
TCP, RDMA, Shared Memory, GPU
Zero-copy GPU memory transfers over RDMA
RDMA requires network support (IB or RoCE)
http://openucx.org
Unified Communication X
31
ACCELERATED SHUFFLE RESULTSInventory Pricing Query
CPU GPU GPU+UCX
Series1 228 45 8.4
0
50
100
150
200
250
Qu
ery
Du
rati
on
in S
eco
nd
s
32
ACCELERATED SHUFFLE RESULTSETL for Logistical Regression Model
CPU GPU GPU+UCX
Series1 1556 172 79
0
200
400
600
800
1000
1200
1400
1600
1800
Qu
ery
Du
rati
on
in S
eco
nd
s
33
WHAT’S NEXT?
34
WHAT’S NEXT
Open Source/Spark 3.0 Release
Nested types Arrays, Structs, and Maps
Decimal type
More operators
GPU Direct Storage
Time zone support for timestamps(only UTC for now)
Higher order functions
UDFs
COMING SOON
FURTHER OUT
35
WHERE TO GET MORE INFOLearn more about the RAPIDS Accelerator for Apache Spark
Visit: NVIDIA.com/Spark
Please use the “contact us” to get in touch with NVIDIA’s Spark team
Listen to how Adobe Email Marketing Intelligent Services leverages the RAPIDS Accelerator & Spark 3.0 on Databricks
Upcoming Spark+AI Summit Sessions on GPU support for Apache Spark 3.0:
Deep Dive into GPU Support in Apache Spark 3.x
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Preview of Spark 3.0 GPU Features: NVIDIA.com/Spark-Book
QUESTIONS
37
BACKUP SLIDES
38
FAQS
Q: What are the minimum requirements?
A: The RAPIDS accelerator requires:
Apache Spark 3.0
RAPIDS cudf 0.14
CUDA 10.1 or later
NVIDIA GPU with Pascal architecture or later
Ubuntu 16.04+ or CentOS 7+
39
FAQS
Q: Do all cluster nodes require GPUs?
A: All Spark executors running with the RAPIDS accelerator require their own GPU.
The Spark driver process does not require a node with a GPU.
Q: Can I run more than one executor per GPU?
A: No, there must be a one-to-one mapping between Spark executors and GPUs.
You can run more than one concurrent task per executor.
40
FAQS
Q: Will the RAPIDS accelerator work in the cloud?
A: Yes, if the VM environment meets the minimum requirements.
Q: Will the RAPIDS accelerator be available for Apache Spark 2.x?
A: No. The columnar processing APIs added in Apache Spark 3.0 are required.
Q: How can I tell if an operation is being accelerated?
A: Accelerated operations appear in the query explanation and SQL UI.
41
RAPIDS ACCELERATOR CONFIGURATION
spark.rapids.sql.enabled is the master enable
spark.rapids.sql.explain enables logging of operations not accelerated
spark.rapids.sql.concurrentGpuTasks controls concurrent task count per GPU
42
SPARK ACCELERATOR-AWARE SCHEDULING
Tracking JIRA: SPARK-24615
Request executor and driver resources (GPU, FPGA, etc.)
Resource discovery
Specify task resources
API to determine assigned resources
YARN, Kubernetes, and Standalone
43
SPARK ACCELERATOR-AWARE SCHEDULING
./bin/spark-shell --master yarn --executor-cores 2 \
--conf spark.driver.resource.gpu.amount=1 \
--conf spark.driver.resource.gpu.discoveryScript=/opt/spark/getGpuResources.sh \
--conf spark.executor.resource.gpu.amount=2 \
--conf spark.executor.resource.gpu.discoveryScript=./getGpuResources.sh \
--conf spark.task.resource.gpu.amount=1 \
--files examples/src/main/scripts/getGpusResources.sh
Sample Command-Line
44
SPARK STAGE LEVEL SCHEDULING
Tracking JIRA: SPARK-27495
Specify task resource requirements per RDD operation
Dynamically allocates containers to meet resource requirements
Schedules tasks on appropriate containers