A Scalable Artificial Intelligence Data Pipeline for ... · Sanhita Sarkar, Ph.D. Global Director, Analytics Software Development. Track: Machine Learning. September 25, 2019. 2019
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sanhita Sarkar, Ph.DGlobal Director, Analytics Software Development
Performance of a Data Pipeline from Object Notifications to Metadata Search5Performance of a Data Pipeline from Object Notifications to Metadata Search5
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Aggregated vs. Disaggregated Architecture for AI
Data Network
Shared pool of NVMe™ Flash Storage
Data Network
Object Storage
• Model training is limited to the flash storage capacity integrated in the GPU servers and this incurs multiple data transfers from the object storage, once data grows over capacity on the servers.
• Incurs delays in model serving and inference.
Object Storage
• Model training can scale independently on a disaggregated pool of GPUs, shared flash and object storage, with no subsequent data transfers.
• Inference by the model serving client is faster due to immediate access to trained models on the shared flash storage.
Model Training
Pool of GPUs
GPU server(s) collocated with Flash
Centralized Data Repository
Data Preparation
Model Serving
Model ServingModel TrainingCentralized Data Repository
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Image Training Performance: Disaggregated Flash and GPUs
15
• On a disaggregated architecture comprising an NVMe™ All-Flash Array, and
a single 8-GPU server, the training performance with most AI models scales almost linearly up to 4 GPUs and is ~7x with 8 GPUs, except for AlexNet and LeNet, where the training performance saturates at 1 GPU itself.
multiple GPU servers, the training performance scales linearly with the number of servers, irrespective of the choice of AI models.
I/O Throughput: Image Training on Disaggregated Flash and GPUs
16
• On a disaggregated architecture comprising a single 8-GPU server and an NVMe™ All-Flash Array –
the average I/O throughput during training using the ResNet-50 model (compute intensive) is ~653 MB/s, the GPU utilization being ~100% (size of image data is 164 GB, each image being ~100 KB)
the average I/O throughput during training using the LeNet model (I/O intensive) is ~2.2 GB/s, the GPU utilization being ~20%. So the LeNet model requires ~3.4x the I/O throughput, compared to ResNet-50.
I/O Throughput: Image Training with ResNet-50 (8 GPUs )
I/O Throughput GPU usage
0
20
40
60
80
100
1900
2000
2100
2200
2300
GPU
usa
ge
I/O
Thr
ough
put
(MB/
s)
Time
I/O Throughput: Image Training with LeNet (8 GPUs)
I/O Throughput GPU usage
Image Inference Performance on Disaggregated Flash and GPUs
17
• The inference throughput is measured as the aggregated images/sec inference results using ImageNet datasets across multiple GPU containers.
• On a disaggregated architecture comprising an NVMe™ All-Flash Array, and
a single 8-GPU server, results show that the inference image processing rates are between ~3.3x to ~4x the training rates of the corresponding TensorFlow models.
multiple GPU servers, users have the flexibility to run mixed AI workloads for training and inference, by dedicating one or two GPUs to inference for every 8 GPUs, rest being allocated to training.
ecImage Inference Performance Throughput: 8-GPU server
1GPU2GPU4GPU8 GPU
Higher the Better
0
40000
80000
120000
160000
8GPU 16GPU 32GPU 64GPU
Imag
es/s
ec
Image Inference Performance Throughput by Model
Inception-V4ResNet-152VGG-16Inception-V3ResNet-50
Higher the Better
Example Configurations: GPU servers and an NVMe™ All-Flash Array
18
• An example allocation strategy of NVMe All-Flash arrays is considered for executing AI workloads –• 30% of the I/O bandwidth for model training, and the remaining 70% for various phases like data preparation,
inference, and other activities
• Considering the above allocation strategy, example configurations are derived using the I/O throughput required on a disaggregated architecture comprising a single NVMe All-Flash Array and a 8-GPU server, while using ResNet-50 and LeNet models for training - A single NVMe All-Flash Array can scale up to twelve 8-GPU servers running ResNet-50 model for the training phase, with
100% utilization of GPUs. With LeNet model, a single NVMe All-Flash Array can optimally scale up to three 8-GPU servers for the training phase.
Data Ingestion Performance with Kafka to an NVMe™ All-Flash Array
21
• Each connector (known as sink connector) is dedicated to a Kafka topic, i.e., the number of connectors is equal to the number of topics/files.
• Write throughput increases linearly from 8 to 16 connectors and by 1.5x from 16 to 32 connectors.
• With 4 Kafka Connect worker nodes and 32 sink connectors pointing to a single NVMe All-Flash Array, the average write throughput achieved is 3.2 GB/s.
• This test helps to determine the number of connectors to configure in the Kafka Connect cluster, based on the number of flash arrays, the input ingestion rates, and the available I/O throughput from the flash arrays.
• 1 JVM is used for each worker node in the Kafka Connect cluster, with a JVM heap size of 64 GB.
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Performance: Object Notifications with Respect to Object Puts
• The benchmark comprises of 3 clients working as load generators and performing 'put' operations of a total of 2.4M puts using 300, 600 and 1200 connections in an object storage. The average notification rates of these object puts are measured simultaneously, with a batch size set to 1000, followed by measurements of the Kafka-Elastiscsearch Connector throughput and latency.
• The “average” rate of object notifications lags slightly behind the average rate of puts by approximately 572 to 730 notifications/sec with 600 and 1200 connections respectively, while there is no lag with 300 connections.
• The Kafka-Elasticsearch Connector consumes the messages at the same average rate of 1.7 MB/s to 2.1 MB/s that an object notification service (as a Kafka Producer) sends to the Kafka topic, with no visible latency in the Kafka cluster for acknowledging the notifications or in the Kafka-Elasticsearch Connector, as a consumer.
• An optimal throughput with no latency is achieved with a minimal configuration of Kafka, comprising of 2 nodes in the Kafka cluster, 1-2 Connect worker nodes with a total of 4 connectors, and 4-8 partitions of the Kafka topic. The CPU usage is 10-15% in both the Kafka cluster and Connect worker node(s), the Kafka JVM heap size being 32 GB.
Elasticsearch Implementation for Metadata Indexing and Search- Up to 3 Billion Object Notification Messages
Load Balancer
Clients
Shard 1 leader
Shard 7 Replica
Shard 2 leader
Shard 8 Replica
ES Instance 1
Indexed data filesTransaction Log files
Node 4
Shard 7 leader
Shard 1 Replica
Shard 8 leader
Shard 2 Replica
ES Instance 4
Indexed data filesTransaction Log files
1. Dividing into multiple shards allows for speeding up index creation times, but slows down query response times due to the overhead of aggregating results from all the shards.
2. Optimal shard configurations for indexing and querying 3 billion object notification messages of 800 bytes, are determined:
• 2 primary shards (and 2 replicas) per Elasticsearch instance;
• As the number of messages grows, a max of 375M messages (~280 GB) per shard provides an optimal performance for both querying and indexing.
Elasticsearch Performance: Index Creation Times- Up to 1 Billion Object Notification Messages
• This benchmark measures the index creation times by Elasticsearch, as the number of object notification messages grows up to 1 billion. Tests have been performed using 4 nodes and 8 shards in the Elasticsearch cluster and in batches of 25M puts in an object storage, with a 2-hour interval between the batches. The average rate of puts in the object storage and the average rate of notification messages during this benchmark have been 1,100 and 1,092, respectively.
• Results show that the index creation rate by Elasticsearch is similar to the object notification message rate.
100M 200M 300M 400M 500M 600M 700M 800M 900M 1000M
Tim
e (H
ours
)
Number of Object Notification Messages in Elasticsearch Index
Index Creation Times up to 1 Billion Object Notification Messages(1,100 Avg. Puts/Sec and 1,092 Avg. Object Notification Messages/Sec)
Elasticsearch Performance: Query Response Times and Throughput- Considering Object Notification Messages to Grow to 3 Billion over Time
• This benchmark executes a suite of 1,100 mixed queries, comprising of simple and complex queries, against the object notification messages. A simulation of messages up to 3 billion has been done using the existing 1 billion messages to measure the query response times for 3 billion and to come up with a reference implementation.
• Considering same number of nodes, the median response times of Elasticsearch for mixed queries increase with an increasing number of shards, along with a corresponding reduction in throughput, due to the overhead of aggregating the query results from all shards.
• Considering same number of shards, the median response times decrease by ~20%-38% from 2 to 4 nodes, along with a corresponding increase in throughput by 14%- 106%, the difference of throughput being significant with 3 billion messages.
• Optimal query response times and throughput are achieved with 4 nodes and 8 shards in an Elasticsearch cluster.
43 3
54
10
0
2
4
6
8
10
12
2 nodes(8 Shards)
4 nodes(8 Shards)
4 nodes(16 Shards)
Resp
onse
Tim
e (m
s)
Elasticsearch: Median Response Times (Mixed Queries)
Performance of a Data Pipeline from Object Notifications to Metadata Search5
Summary and Best Practices
32
• Implementing a disaggregated architecture of GPU compute, a shared pool of NVMe™ All-Flash Arrays and object storage system(s) has multiple benefits while executing AI workloads – Subsequent data transfers in and out of local SSDs of GPU servers can be avoided, as the data grows over capacity. Inference is faster due to immediate access to trained models on the shared flash storage. Businesses can independently scale GPU servers and shared flash arrays to meet the changing needs of their AI workloads, along
with the flexibility to run mixed AI workloads for training and inference. With a preferred allocation strategy of the I/O bandwidth, various teams can share and scale the flash arrays to serve multiple
GPU servers in a cost-effective manner. A high capacity object storage may be used as a landing zone for the ingested data as well as an archival solution.
• As a best practice to attain an optimal ingestion performance with Kafka to NVMe All-Flash Arrays – Number of connectors and worker nodes in the Kafka Connect cluster needs to be tuned, based on the number of flash arrays,
the input ingestion rates, and the available I/O throughput from the arrays; Based on the I/O throughput requirement, high-speed network interfaces and topology need to be configured for the Kafka
cluster, the worker nodes of the Kafka Connect cluster, and the flash array(s) to eliminate network bottlenecks.
• As a best practice to attain an optimal performance of object notifications and Elasticsearch – A minimal configuration of Kafka is needed, comprising of 2 nodes in the Kafka cluster, 1-2 Connect worker nodes with a total of
4 connectors, and 4-8 partitions of the Kafka topic. The object notification message rate is almost similar to the object put rate. The index creation rate of Elasticsearch is similar to the object notification message rate. Optimal indexing and search
performance is achievable with 4 nodes and 8 shards in an Elasticsearch cluster for up to 3 billion object notification messages.