Mike Muller CTO
Is there anything new in heterogeneous computing?
Evolution
Computing
Embedded
PC
77
82
97
07 10
Mobile Computing
IOT
Cloud Server
1960 1970 1980 1990 2000 2010 2020
Wearable Intelligence
13
89
Consumer Smart Appliances
93
What’s the Innovation?
MEMS
CCD
Wireless 3 G
Semiconductor Process?
Media GPS
Social Media?
10
100
1,000
10,000
1990 1995 2000 2005 2010 2015 2020 2025
LELE
SADP
LELELE EUV
SAQP
EUV LELE
EUV + DWEB
EUV + DSA
FinFET HNW
III-V GE
VNW 2D: C, MoS
spintronics NEMS
Patterning
Planar CMOS
Al wires CU wires
Switches
Interconnect
NMOS PMOS
Mobility Trends: CMOS
14nm 10nm 7nm 5nm 3.5nm
HKMG Strain
// 3DIC Opto I/O Opto int Seq. 3D
Graphene wire, CNT via
cm2 /
(V·s
)
Printing: Moore’s Law and Ink Jets
10,000 nozzles
10 nozzles
1980 1985 1990 1995 2000 2005 2010 2015 2020
1E11
1E10
1E9
1E8
1E7
1E6
1E5
1E4
1E3
1E1
1E0
1E-1
1E-2
1E-3
100’s microns
10’s microns
Drops/Second 1/Size (pL-1)
Printing and Imprinting Thin Film Transistors (TFT)
Can be transparent, bio-degradable and even ingestible
Unit cost 1000 less than mainstream CMOS CMOS @ $40,000/m2 vs. TFT @ $10/m2
Printing CAPEX can be less than $1,000 350dpi = 200um @ 20 m/s
Can print batteries, antenna
Mainly organic at ~20 volts
Imprint CAPEX a $2M DVD press is high volume Better controllability hence higher density and performance
1um today scale to 50nm features as used today for BluRay discs
Mainly Inorganic NMOS only at ~2 volts
Mobility Trends: CMOS & Thin Film Transistors
0.00001
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
1990 1995 2000 2005 2010 2015 2020 2025
Conventional NMOS
Conventional PMOS
TFT
CPU
cm2 /
(V·s
)
ARM1 3µ
6MHz
CortexM0 2µ
20kHz
Top Right
and Bottom Left
1998 Manual Partitioning
C & Assembler
2013 Manual Partitioning
C++ & OpenCL/RenderScript
ARM DSP ARM GPU
Vector Add Reduction Matrix Mul
GPU OpenCL on GPU 1.00 1.00 1.00
GPU OpenCL on FPGA 0.14 0.02 0.89
FPGA OpenCL on FPGA 1.71 1.62 31.85
+ +
Is There Anything New in Heterogeneous Computing?
How Do People Program?
Simple, old-school ray tracer
Start with C++ code and accelerate the code with Heterogeneous Systems void traceScreen() { for(y = 0; y < height; ++y) { for(x = 0; x < width; ++x){ Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); } } }
void traceScreen() { par_for_2D(height, width, [&](int y, int x) { Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); }); }
Mobile Web
Embedded ~200k
Desktop
~20M Programmers
Moving the Code onto OpenCL 1.x
Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject
b) Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
c) Get rid of the virtual function calls
d) Change the classes to structs
e) Get rid of recursion in CSGObject
f) Avoid accessing the global scene variable in accelerated code
g) Port the code base to OpenCL C
Moving the Code onto OpenCL 2
Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject
b) Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
c) Get rid of the virtual function calls
d) Change the classes to structs
e) Get rid of recursion in CSGObject
f) Avoid accessing the global scene variable in accelerated code
g) Port the code base to OpenCL C
OpenCL 2 solves point a) with shared address space, but not the rest
Moving the Code onto C++ AMP
Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject
b) Rewrite the use of std::vector, as C++ AMP cannot call into C++ standard library
c) Get rid of the virtual function calls
d) Change the classes to structs
e) Get rid of recursion in CSGObject
f) Avoid accessing the global scene variable in accelerated code
g) Port the code base to OpenCL C
C++ AMP solves points d), f) and g), but not the rest
Moving the Code onto HSA
Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject
b) Rewrite the use of std::vector, as HSAIL does not understand C++ data type internals
c) Get rid of the virtual function calls
d) Change the classes to structs
e) Get rid of recursion in CSGObject
f) Avoid accessing the global scene variable in accelerated code
g) Port the code base to a language on top of HSAIL
HSA solves points a), c), d), e) and soon f)
What Makes GPUs Good For Power Efficient Compute?
Relaxed single-threaded performance No dynamic scheduling
No branch prediction
No register renaming, no result forwarding
Longer pipelines
Lower clock frequencies
Multi-threading Tolerate long latencies to memory
Increasing the ALU/control ratio Short-vectors exposed to programmers
SIMT/Warp/VLIW/Wavefront based execution
LITTLE big
Heterogeneous Compute Homogeneous Architecture
How about a SIMTish ARM? Familiar programming model, C++ and OpenMP
Fewer seams
Sharing data structures and function pointers/vtables
..
Throughput
Load/Store Pipe FP Pipe
Integer Pipe
SIM
T
Que
ue
Wri
te
RESEARCH
Moving the Code onto a Warped ARM
Need to make the following changes Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C
Performance vs Effort
We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various ways, to investigate the tradeoff between programmer effort and performance payoff
SGEMM version Speedup Effort
ARM in C 1x Low
ARM in C with NEON intrinsics, prefetching 15x Medium - High
ARM in assembly with NEON, prefetching 26x High
SIMTish ARM in C 35x Low
SIMTish ARM in C, unrolled 44x Low - Medium
Mali GPU x 4 way 136x High
Scale Needs Standards
IPv4
IPv6 Sonosnet
Works for geeks… No proper orchestration Battle for the apps platform Needs home IT support Or only single manufacturer
Imagine that there were a 1000 of these connected devices….
Functional Becomes the Internet of things
Functional Little Data
Mike
Life Insurance
Gym
Car Insurance
My Data
Their Data
!
X X Rob Curtis Haymakers Cambridge
Picture by Keith Jones
Sharing Needs Trust
IOT Medical Devices
First implantable Pacemaker 1958
Can a pacemaker be hacked to kill? Or just a plot line in US TV series
RF interface for adjusting settings
First hacked in 2008 “Sustained effort by a team of specialists” – The New York Times
Range a few cm
Today MIT grad students
One weekend
Range 50 feet
Trust Needs Security
It’s a Heterogeneous Future
Open Data and Objects
The future R
each
Smart Everything
SaaS M2M
Applications
Internet / broadband
Mobile Telephony
Sensors & Actuators Networks
Fixed Telephony Networks
Mobile internet
Scale Needs Standards Sharing Needs Trust Trust Needs Security
Today