Top Banner
GTC 2015 – Session S5429 Creating Dense Mixed GPU and FPGA Systems With Tegra K1s Using OpenCL & CUDA Lance Brown, Director - HPC ColoradoEngineering.com [email protected] 719-641-7287 Cell 27 March 2015 ColoradoEngineering.com - Public Release 1
19
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: S5429_LanceBrown

GTC 2015 – Session S5429Creating Dense Mixed GPU and FPGA Systems

With Tegra K1s Using OpenCL & CUDA

Lance Brown, Director - HPC

ColoradoEngineering.com

[email protected]

719-641-7287 Cell

27 March 2015 ColoradoEngineering.com - Public Release 1

Page 2: S5429_LanceBrown

We Can Solve Really Cool Problems Now

• Heterogeneous computing is more than CPU + GPU

• ARM processors changed the game• NVIDIA - GPU + ARM - CUDA

• TI - DSP + ARM - OpenCL

• Altera - FPGA + ARM – OpenCL

• Scalable from handheld to Enterprise & HPC

27 March 2015 ColoradoEngineering.com - Public Release Slide 2

Page 3: S5429_LanceBrown

Why Listen to CEI?

• Been using FPGAs since 1985

• Been solving massively parallel problems for over 30 years

• We have/are designing multiple 24 & 32 layer boards featuring Altera FPGAs & NVIDIA GPUs

• Early adopter of new technologies and experts at marrying existing technologies in new ways

27 March 2015 ColoradoEngineering.com - Public Release Slide 3

Page 4: S5429_LanceBrown

Game Changer #1Altera’s Hard Floating Point Unit IP & OpenCL

• FPGAs have traditionally supported soft floating point

• Altera introduced IEEE 754 Hard Floating Point with Arria 10

• Arria 10 FPGAs are rated from 140 GigaFLOPS (GFLOPS) to 1.5 TeraFLOPS (TFLOPS)

• Details at: https://www.altera.com/en_US/pdfs/literature/po/bg-floating-point-fpga.pdf

• OpenCV & Suricata Implementations Using OpenCL

• Partial Reconfiguration for Streamlined OpenCL Development

• On Intel’s 14 nm FinFET Fab

27 March 2015 ColoradoEngineering.com - Public Release Slide 4

Page 5: S5429_LanceBrown

Game Changer #2NVIDIA Makes Tegra K1 Available• GPU + ARM @ low power

• Very important – camera interfaces galore

• Can do significant processing at each edge node now

• Jetson Kit – awesome eval kit & affordable

• More importantly – chipset available through Arrow!

• Details at: https://developer.nvidia.com/hardware-design-and-development

27 March 2015 ColoradoEngineering.com - Public Release Slide 5

Page 6: S5429_LanceBrown

CEI’s Epiphany – Ultimate CV PlatformAltera Arria 10 & NVIDIA Tegra K1?

+

1500 GFLOPS 326 GFLOPS27 March 2015 ColoradoEngineering.com - Public Release Slide 6

Page 7: S5429_LanceBrown

First Union – Dual TK1s + Arria 10HPC-A10-K1GPU

K61Health

Monitoring

HPC-A10HPC-A10-K1GPU

X8 PCIE Gen3

GigE

2/4 GB Micron HMC

QDR II+144 Mb

1334 MT/s

QSFP+1 – 40 GbE4 - 10 GbE

QSFP+1 – 40 GbE4 - 10 GbE

USBBlaster

DisplayPort - Source DisplayPort - SinkUSB 3.0

USB 3.0

SMA SMA

PCIESwitch

VITA 57 FMCHPC

(Optional)

QDR II+144 Mb

1334 MT/s

Tegra K1 System-On-Module TK1-SOM

16/32/64 GBeMMC

2/4/8 Gbit

DDR3

USB GigE HDMI

Tegra K1 System-On-Module TK1-SOM

16/32/64 GBeMMC

2/4/8 Gbit

DDR3

USB GigE HDMI

SMA

X4 PCIE GEN2 EXTRA X4 PCIE GEN2

SMA CLK-IN

TK1-SOM Tegra K1 System-On-Module

16/32/64 GB

eMMC

1/2/4 GB DDR3L

USB 2.0

GigE HDMI

2 In

ches

2 Inches

External Power x4 PCI Gen2, Clocks, i2c

JTAG

UART

AvailableStand-alone

27 March 2015 ColoradoEngineering.com - Public Release Slide 7

Page 8: S5429_LanceBrown

HPC-A10-K1GPUDesign Details

• NVIDIA GPUDirect Support

• TK1’s are root nodes

• TK1’s can be field upgraded

• 8 - High Speed 10GbE Ports

• CUDA on TK1

• OpenCL on Arria 10

• 2 GB/s to each TK1

• HMC is 17X faster than DDR3

• 12 to 25 Camera/Sensor I/Os

27 March 2015 ColoradoEngineering.com - Public Release Slide 8

Page 9: S5429_LanceBrown

• 1 to 21 Cameras/Sensors

• Makes dumb cameras smart

• 10/40 GbE Sensors

• OpenCL on FPGA

• CUDA on Tegra

27 March 2015 ColoradoEngineering.com - Public Release Slide 9

Single Node

C

C

C

C

C

C

CC

C

4 – 1

0 G

bE

4 – 1

0 G

bE

Display Port

USB

/GigE

USB

/GigE

C

C

C

C

C

C

C

C

FMC

C C CC

Page 10: S5429_LanceBrown

Tesla K80s + HPC-A10-K1GPU

C

C

C

C

C4 – 1

0 G

bE

4 – 1

0 G

bE

Display Port

USB

/GigE

USB

/GigE

C

C

C

C

C

C

C

C

FMC

C C CC

Telsa K80

Telsa K80

Telsa K80

Telsa K80

GPU

Direct

27 March 2015 ColoradoEngineering.com - Public Release Slide 10

Page 11: S5429_LanceBrown

27 March 2015 ColoradoEngineering.com - Public Release Slide 11

Sensor GatewaySmart Host Bus Adapter (HBA)

40

Gb

E 40

Gb

E FMC40

Gb

E4

0 G

bE 4

0 G

bE FM

C40

Gb

E

Sensor Cloud

Radar, MRI, PET,

Camera, EW, etc

Telsa K80 Cluster

Telsa K80 Cluster

Page 12: S5429_LanceBrown

• Easy to do now

• https://youtu.be/o5WtYiY5Hao

• Proficient in a day or two

• CAPI support too

• 95% to 99% Efficient as VHDL

27 March 2015 ColoradoEngineering.com - Public Release Slide 12

Programming FPGAs with OpenCL

Page 13: S5429_LanceBrown

EDGE Node Processing

• Process on the EDGE using GRID

• Distributed deep learning node

• Low cost

• 4G enabled

• Fusion of Radar, EO, IO and Sound

• Download apps from Google Play

• Feedback to Tesla K80s via GRID

• SmartCity Ready

• Military Level Device Security Built-in

NVIDIATegra K1/X1

Computer VisionVideo Compression

5 MP Camera 5 MP Camera

5 M

P C

amer

a5

MP

Cam

era

24 GHz RadarSystem

Motion Detection

Camera Queuing

COMMSAlerts

Streaming Video4G LTE

WiFiBlueTooth

USB

AlteraCyclone VAppliance Security

Pat

ch A

nte

nn

aP

atch

An

ten

na

Patch Antenna Patch Antenna

Directional MicDirectional Mic

Dir

ecti

on

al M

icD

irec

tio

nal

Mic

27 March 2015 ColoradoEngineering.com - Public Release Slide 13

Page 14: S5429_LanceBrown

Distributed Aperture SystemDistributed Sensors• Large vehicle/Military ADAS

• SA360 systems

• Retrofit casino camera systems

• Make any sensor system smart

• Tegra K1/X1’s Scalable

• Mixture of CUDA & OpenCL

x4 Gen2 PCIe2 GB/S

x4 Gen2 PCIe2 GB/S

x4 Gen2 PCIe2 GB/S x4 Gen2 PCIe

2 GB/S

x4 Gen2 PCIe2 GB/S x4 Gen2 PCIe

2 GB/Sx4 Gen2 PCIe

2 GB/S

x4 Gen2 PCIe2 GB/S

x4 Gen2 PCIe2 GB/S

64 GB eMMC

64 GB eMMC

64 GB eMMC

64 GB eMMC

64 GB eMMC

64 GB eMMC

64 GB eMMC

64 GB eMMC

64 GB eMMC

8 GBDDR4

8 GBDDR4

8 GBDDR4

8 GBDDR4

8 GBDDR4

8 GBDDR4

8 GBDDR4

8 GBDDR4

8 GBDDR4

USB3 orGigE

USB3 orGigE

USB3 orGigE

USB3 orGigE

USB3 orGigE

USB3 orGigE

USB3 orGigE

USB3 orGigE

USB3 orGigE

HDMI

4/8 GBHMC

QDR-II+Or

QDR-IV

HDMI HDMI HDMI HDMI

HDMIHDMIHDMIHDMI

AlteraArria 10 SoC

x2 ARMOpenCL

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265

NVIDIATegra X1

x4 ARMCUDA/Linux

OpenCV

H.264/H.265Removable SATA Storage

40/10 GbE Ports

Main Display GPU

27 March 2015 ColoradoEngineering.com - Public Release Slide 14

Page 15: S5429_LanceBrown

ChallengesHardware, Interconnects & Software• FPGA + GPU

• CUDA, OpenCL or CUDA + OpenCL• Working with MDA & AFRL on solutions

• Bandwidth• Tegra K1/X1 are x4 Gen2 PCIe – limits number and resolution of sensors attached to

the Tegra.• More processing has to be done of Tegra, but that is okay since Tegra’s keep

increasing in power every year• Gen3 PCIe would be awesome• PCIe backplane – Using 40 GbE ports eliminates PCIe bottleneck

• Root Nodes• Tegra wants to root complex. Non-transparent switches need to be used• If Tegra could be an endpoint, a whole new world would open up

27 March 2015 ColoradoEngineering.com - Public Release Slide 15

Page 16: S5429_LanceBrown

Future ArchitecturesEven Cooler Designs Possible• Altera

• Arria 10 SoC• Eliminates need for x86 CPU to run OpenCL• Truly stand-alone appliances• 100 GbE interfaces

• Stratix 10 and Stratix 10 SoC• >10 TFLOPs for 100W• Details: https://www.altera.com/products/fpga/stratix-series/stratix-10/overview.html

• NVIDIA VOLTA• Looking for NVLink intermingling with FPGAs

• Virtual FPGAs + Virtual GPUs• Allow instant scaling and data protection

27 March 2015 ColoradoEngineering.com - Public Release Slide 16

Page 17: S5429_LanceBrown

Summary

• GPU + FPGA can solve amazing and fun problems

• Tegra K1/X1 provide incredible capability at low cost which reduces the size of FPGA needed.

• OpenCL and Hard Floating Point IP make the Altera FPGAs a great partner with NVIDIA GPUs

• CEI is making scalable solutions to allow application developers to deploy from handheld to enterrpise/HPC

27 March 2015 ColoradoEngineering.com - Public Release Slide 17

Page 18: S5429_LanceBrown

Hardware & Software Capabilities• Enterprise & Embedded SW

• Net Centric, SOA, web services, J2EE,SQL

• C/C++

• CUDA & OpenCL

• Embedded real time code, RTOS, hardware drivers, Fault Detection / Fault Isolation, etc.

• Simulations, APIs, and GUIs

• Cognitive Software

• Device Drivers

• National Instruments Labview

• DO-178C

• FPGA designs (VHDL/Verilog/Simulink)

• RF Design

▪ System / Subsystem Designs

▪ 30+ complex board designs

▪ 32 layer PCBs with blind and buried vias

▪ High speed (100s MHz x GHz)

▪ Analog (RF & I/Q Receivers)

▪ Digital (FPGAs, DSPs, general purpose)

▪ ADC and DAC

▪ Standard and custom IO (busses, fabrics, SerDes, etc.)

▪ Ruggedization and thermal management

▪ CSWaP

▪ Serial I/O (e.g. PCIe, Serdes)

▪ DO-254

27 March 2015 ColoradoEngineering.com - Public Release 18

Page 19: S5429_LanceBrown

For More Informationon Standard Products and

Custom Engineering ServicesCall Us – 719-388-8582 Office

Emails Us – [email protected] Us – Colorado Springs, CO (Sunny 300+ Days)

Browse Us – www.ColoradoEngineering.com

27 March 2015 ColoradoEngineering.com - Public Release 19