Dell EMC Technical White Paper Deep Learning Inference on PowerEdge R7425 Abstract This whitepaper looks at the performance and efficiency of Deep learning inference when using the Dell EMC PowerEdge R7425 server with NVIDIA T4-16GB GPU. The objective is to show how PowerEdge R7425 can be used as a scale-up inferencing server to run production level deep learning inference workloads. Mar-19
31
Embed
Deep Learning Inference on PowerEdge R7425 · 2019-09-17 · Deep Learning Inference on PowerEdge R7425 ... frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. ... 4.2 Test
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dell EMC Technical White Paper
Deep Learning Inference on PowerEdge R7425
Abstract
This whitepaper looks at the performance and efficiency of Deep learning
inference when using the Dell EMC PowerEdge R7425 server with NVIDIA
T4-16GB GPU. The objective is to show how PowerEdge R7425 can be
used as a scale-up inferencing server to run production level deep learning
inference workloads.
Mar-19
Dell EMC Technical White Paper
Revisions
Date Description
Mar-19 Initial release
Acknowledgements
This paper was produced by the following persons:
Authors:
Bhavesh Patel, Dell EMC Server Advanced Engineering
Vilmara Sanchez, Dell EMC Server Advanced Engineering
Contributor: Josh Anderson, Dell EMC Server System Engineering
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the
information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
❖ Nvidia TensorRT™ support team & Nvidia account team for their expedited support.
Deep Learning Inference on PowerEdge R7425
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
4
1 Overview
This paper talks about Deep Learning Inference using NVIDIA T4-16GB GPU and TensorRT™. The
NVIDIA T4-16GB GPU is based on their latest Turing architecture which significantly boosts
graphic performance using a new GPU processor (streaming multiprocessor) with improved
shader execution efficiency and new memory system architecture that supports GDDR6 memory
technology. Turing’s Tensor cores provide higher throughput and lower latency for AI Inference
applications.
Dell EMC PowerEdge R7425 is based on AMD’s EPYC™ architecture and since EPYC™ architecture
supports higher number of PCIe Gen3 x16 lanes, it allows the server to be used as a scale-up
inference server. It becomes a perfect solution when running large production-based AI
workloads where both throughput and latency are important.
In this paper we tested the inference optimization tool Nvidia TensorRT™ 5 on the Dell EMC
PowerEdge R7425 server to accelerate CNN image classification applications and demonstrate its
capability to provide higher throughput & lower latency for neural models like ResNet50. During
the tests, we ran inferences of image classification models in different precision modes on the
server R7425 using NVDIA T4-16GB GPU, with both implementation of TensorRT™ i.e. the native
TensorRT™ C++ API and the integrated TensorFlow-TensorRT™ integration library. TensorFlow
was used as the primary framework for the pre-trained models to compare the optimized
performance in terms of throughput (images/sec) and latency (milliseconds).
2 Introduction
2.1 Deep Learning
Deep Learning consists of two phases: Training and inference. As illustrated in Figure 1, training
involves learning a neural network model from a given training dataset over a certain number of
training iterations and loss function. The output of this phase, the learned model, is then used in
the inference phase to speculate on new data [1].
The major difference between training and inference is training employs forward propagation
and backward propagation (two classes of the deep learning process) whereas inference mostly
consists of forward propagation. To generate models with good accuracy, the training phase
involves several training iterations and substantial training data samples, thus requiring many-
core CPUs or GPUs to accelerate performance.
Deep Learning Inference on PowerEdge R7425
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
5
Figure 1. Deep Learning Phases
2.2 Deep Learning Inferencing
After a model is trained, the generated model may be deployed (forward propagation only) e.g.,
on FPGAs, CPUs or GPUs to perform a specific business-logic function or task such as
identification, classification, recognition and segmentation. See Figure 2.
1. The focus of this whitepaper will be on the power of Dell EMC PowerEdge R7425 using NVIDIA T4-16GB GPUs to accelerate image classification and deliver high-performance inference throughput and low latency using various implementations of TensorRT™ an excellent tool to speed up inference.
Figure 2. Inference Workflow
Deep Learning Inference on PowerEdge R7425
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
6
2.3 What is TensorRT™?
The core of TensorRT™ is a C++ library that facilitates high performance inference on NVIDIA
graphics processing units (GPUs). It is designed to work in a complementary fashion with training
frameworks such as TensorFlow, Caffe, PyTorch, MXNet, etc. It focuses specifically on running an
already trained network quickly and efficiently on a GPU for generating a result (a process that is
referred to in various places as scoring, detecting, regression, or inference).
Some training frameworks such as TensorFlow have integrated TensorRT™ so that it can be used
to accelerate inference within the framework. Alternatively, TensorRT™ can be used as a library
within a user application. It includes parsers for importing existing models from Caffe, ONNX, or
TensorFlow, and C++ and Python APIs for building models programmatically.
Figure 3. TensorRT Scheme. Source: NvidiaFigure 3. TensorRT™ is a high performance neural
network inference optimizer and runtime engine for production deployment.
Figure 3. TensorRT Scheme. Source: Nvidia
TensorRT™ optimizes the network by combining layers and optimizing kernel selection for
improved latency, throughput, power efficiency and memory consumption. If the application
specifies, it will additionally optimize the network to run in lower precision, further increasing
performance and reducing memory requirements.
The following figure shows TensorRT™ defined as part high-performance inference optimizer and
part runtime engine. It can take in neural networks trained on these popular frameworks,
optimize the neural network computation, generate a light-weight runtime engine (which is the
only thing you need to deploy to your production environment), and it will then maximize the
throughput, latency, and performance on these GPU platforms.
6.1 Percentage of GPU Utilization vs GPU Memory Utilization
The Calibration process consists in calculating the weights of the model at the optimal scaling
factor from FP32 to INT8, this is a long process that could take around 1 hour, in the Figure
25Error! Reference source not found. we can see the percentage of the GPU (green color) and
memory (gray color) consumption during the first hour corresponding to the calibration process;
once the inference graph is calibrated and optimized, it is used to generate the inference. In this
test, we wanted to show how the calibration process was conducted only one time at the GPU 0,
then the optimized inference graph was saved at the cache and reused for inference in the rest
of the GPU’s.
Figure 25. GPU Utilization versus Memory Utilization
Deep Learning Inference on PowerEdge R7425
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
26
6.2 Core GPU Temperature
Figure 26. NVIDIA T4 GPU Temperature
In Figure 26, we can see the GPU temperature during the calibration process at the GPU 0, and
later see how it increases gradually when running the inference, there were sleep time periods
before initiating the inference in the next GPU, so we can appreciate how the temperature
decreases during that period. The same pattern happens with the power drawn.
6.3 Power Draw
Figure 27. NVIDIA T4 GPU Power Consumption
Deep Learning Inference on PowerEdge R7425
Architectures & Technologies Dell EMC | Infrastructure Solutions Group
27
7 Conclusion and Future Work
1. TensorRT™ is an excellent tool to speed up inference, and the results obtained in this project demonstrated the power of Dell EMC PowerEdge R7425 using T4-16GB GPU to accelerate image classification and deliver high-performance inference throughput and low latency production level environments.
2. Comparing TensorRT™ versus native TensorFlow (GPU enabled), TensorRT™ C++ API in INT8 speeds up ResNet50 inference to ~11X over the TensorFlow-FP32 inference on GPU.
3. When comparing the performance of different TensorRT™ implementations, the
optimized inference using TensorRT™ C++ API produces around 2.7X more than the TF-
TRT integration at lower latency. Due to the better performance of native TensorRT™ C++
API implementation, its highly suitable for production environments where throughput &
latency need to be considered.
4. DELL EMC PowerEdge R7425 with NVIDIA T4-16GB GPU performed ~1.8X faster when
comparing it to NVIDIA P4-8GB GPU.
5. After the optimized inference model is generated, it can be deployed into the production
environment. For deployment of models in production environment we are exploring
TensorRT Inference Server (TRTIS), which can run multiple models (and/or multiple
instances of the same model) on multiple GPUs.
Deep Learning Inference on PowerEdge R7425
Architectures & Technologies Dell EMC | Infrastructure Solutions Group