Top Banner
Deep Learning : Convergence of HPC and Hyperscale Sumit Sanyal CEO
20

Deep Learning: Convergence of HPC and Hyperscale

Feb 19, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning: Convergence of HPC and Hyperscale

Deep Learning : Convergence of HPC and Hyperscale

Sumit Sanyal

CEO

Page 2: Deep Learning: Convergence of HPC and Hyperscale

Technology Revolution : Deep Neural Nets Ø  A new paradigm in programming

Ø  DNNs remarkably effective at tackling many problems

Ø  Designing new NN architectures as opposed to “programming”

Ø  Training as opposed to “compiling”

Ø  Trained weights as the “new” binaries

Ø  Big Neural Nets required to process Big Data Ø  Videos, images, speech and text

Ø  DNNs are significantly increasing recognition accuracies which have stagnated for decades

Ø  Used to structure Big Data

Page 3: Deep Learning: Convergence of HPC and Hyperscale

Make a guess at NN architecture

Run on training

data, does it perform

well?

No

Run on test data,

does it perform

well?

No

Yes DONE!

Design Bigger Network

Need more Training Data

Yes

Neural network development

Iterate

Iterate

Train on labeled dataset

Wait 1-7+ days

3

Page 4: Deep Learning: Convergence of HPC and Hyperscale

Ø  Deep Neural Nets require a lot of compute cycles! Ø  An example – image classification:

Ø  Ratio of (Compute cycles : IO bandwidth) significantly higher than non AI algorithms

Ø  Training AlexNet (for image classification) requires ~27,000 flops/input data byte

Ø  Training VGG ~150,000 flops/ data byte

Ø  R3 / R2 à Volume (compute) / Surface(IO BW) Ø  Significantly higher for Deep Nets

Ø  Power dissipation challenges Ø  Compute density limited by DC cooling capacity

Ø  At 1 ~uW / MHz (current state-of-art in 28 nm) requires 300 Watts!

Compute vs. IO

AI is no longer bored J

Page 5: Deep Learning: Convergence of HPC and Hyperscale

Neural Net Computations Ø  All Deep Neural Net implementations have the following properties

Ø  Small set of non-linear transforms

Ø  Small set of linear algebra primitives

Ø  Relatively modest dynamic range of weight/data values

Ø  Very regular/repetitive data flows

Ø  Only persistent memory requirement is for weights

Ø  Updated while learning, fixed for recognition

Ø Variance in the size of the net across applications is >105

Compute cycles will be commoditized; not computers!

Page 6: Deep Learning: Convergence of HPC and Hyperscale

Why minds.ai?

minds.ai was founded on the basis that Deep Learning is the next frontier for HPC – it’s what we do

“…around 2008 my group at Stanford started advocating shifting deep learning to GPUs (this was really controversial at that time; but now everyone does it); and I'm now advocating shifting to HPC (High Performance Computing/Supercomputing) tactics for scaling up deep learning. Machine learning should embrace HPC. These methods will make researchers more efficient and help accelerate the progress of our whole field.” – Andrew Ng, Feb 2016

HPC Design Expertise

Neural Network Design Expertise

Image and Video Recognition

Expertise

GPU Programming Expertise + + +

Page 7: Deep Learning: Convergence of HPC and Hyperscale

Ø  Instruction level – SIMD, VLIW etc. Ø  Thread level – warps Ø  Processor level – many cores Ø  Server level – many GPUs in a server Ø  Cluster level – many servers with high BW interconnect Ø  Data Center level

Ø  Planet level J

7 levels of parallelism

Page 8: Deep Learning: Convergence of HPC and Hyperscale

Ø Exponential growth in Data Centers Ø  Commoditization of Enterprise silicon Ø  Traditional mobile players announcing ASICs for enterprise

compute Ø Higher demands for compute density

Ø  GPGPUs have won the first round Ø  Dennardian scaling is breaking down

Ø Power dissipation will emerge as major challenge Ø  Chip level, server level and DC level

ASICs for DNNs

Page 9: Deep Learning: Convergence of HPC and Hyperscale

Age of dark silicon

Transistor property Dennardian Post Dennardian

# of transistors ( Q )

S2 S2

Peak clock frequency ( F )

S S

Capacitance ( C )

1/S 1/S

Supply Voltage ( Vdd )

1 / S2 1

Dynamic Power (QFCVdd

2) 1 S2

Active Silicon 1 1/S2

* S is the ratio of feature size between next generation processes

Page 10: Deep Learning: Convergence of HPC and Hyperscale

Ø Dark silicon will drive heterogeneity Ø  Multi core architectures with different Instruction Sets

Ø  Power aware scheduling across cores

Ø  Decreasing parts of the chip can run at full clock frequency

Ø Specialized silicon for server blades Ø  Bridges to intra server and inter server communications

Ø  Last level caching support, caching across MPI

Ø  Distributed compute in network interfaces

Heterogeneity in Enterprise silicon

Page 11: Deep Learning: Convergence of HPC and Hyperscale

Ø  Instruction level – SIMD, VLIW etc. Ø  Thread level – warps Ø  Processor level – many cores Ø  Server level – many GPUs in a server Ø  Cluster level – many servers with high BW interconnect Ø  Data Center level

Ø  Planet level J

7 levels of parallelism

Page 12: Deep Learning: Convergence of HPC and Hyperscale

Ø  Deep Learning is driving the convergence of High Performance Compute (HPC) and Hyperscale (Data Centers)

Ø  Traditional HPC ecosystems : expensive and bleeding edge

Ø  DC infrastructure : commodity and homogeneous

Ø  Single or dual CPU servers common

Ø  All of this is changing Ø  GPGPUs now common in DCs, initial resistance

Ø  InfiniBand penetration has reached a tipping point

Ø  Dense compute clusters require high bandwidth interconnects

Heterogeneity in Hyperscale

Page 13: Deep Learning: Convergence of HPC and Hyperscale

Ø  Intra server vs. Inter server bandwidths

Ø  Inter server bandwidths will grow faster than intra server

Ø  Lead to larger, denser servers Ø  8 or more GPGPUs per server for DNN training jobs Ø  Will co-exist with CPU based servers for search and

database operations Ø  Many kinds of servers, one size fits all does not work

Server Architectures

Page 14: Deep Learning: Convergence of HPC and Hyperscale

Training Server Reference Design

CPU

GPU GPU

GPU GPU

GPU GPU

GPU GPU

CPU

IB NIC IB NIC

Ethernet

PCIe Root Hub

SSD

SSD

Page 15: Deep Learning: Convergence of HPC and Hyperscale

GPU Training Cluster Architecture

Server Node

IB Switch

Server Node

Server Node

Server Node

Server Node

Server Node

Server Node

Ethernet Switch

RAID & Access

•  Scalable up to 7 Server Nodes

•  8 GPUs per Node

•  ~2.5kw per Server Node

•  100Gbps IB Interconnect

Page 16: Deep Learning: Convergence of HPC and Hyperscale

Ø  Instruction level – SIMD, VLIW etc. Ø  Thread level – warps Ø  Processor level – many cores Ø  Server level – many GPUs in a server Ø  Cluster level – many servers with high BW interconnect Ø  Data Center level

Ø  Planet level J

7 levels of parallelism

Page 17: Deep Learning: Convergence of HPC and Hyperscale

Ø  Computing super clusters Ø  105 variability in size of compute ‘jobs’ Ø  Large number of collocated servers running the same job Ø  High BW, low latency interconnect Ø  Clusters could grow to significant fraction of a Data Center

Ø  Architecture of clusters will be hierarchical and heterogeneous Ø  Edge servers for security and Data management Ø  Dedicated RAID servers Ø  Dedicated compute servers Ø  Control and management nodes

Ø  Multi DC training clusters for big models are technically feasible Ø  Scientific community has performed trans continental simulations

Data Center Architectures

Page 18: Deep Learning: Convergence of HPC and Hyperscale

Thank you.

Page 19: Deep Learning: Convergence of HPC and Hyperscale

•  Sumit Sanyal, Founder and CEO [email protected]

•  Steve Kuo, Co-Founder [email protected]

Contacts

19

Page 20: Deep Learning: Convergence of HPC and Hyperscale

Accelerated Deep Learning