Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM

Mike Muller CTO

Is there anything new in heterogeneous computing?

Evolution

Computing

Embedded

Mobile Computing

Cloud Server

1960 1970 1980 1990 2000 2010 2020

Wearable Intelligence

Consumer Smart Appliances

What’s the Innovation?

Wireless 3 G

Semiconductor Process?

Media GPS

Social Media?

10,000

1990 1995 2000 2005 2010 2015 2020 2025

LELELE EUV

EUV LELE

EUV + DWEB

EUV + DSA

FinFET HNW

III-V GE

VNW 2D: C, MoS

spintronics NEMS

Patterning

Planar CMOS

Al wires CU wires

Switches

Interconnect

NMOS PMOS

Mobility Trends: CMOS

14nm 10nm 7nm 5nm 3.5nm

HKMG Strain

// 3DIC Opto I/O Opto int Seq. 3D

Graphene wire, CNT via

Printing: Moore’s Law and Ink Jets

10,000 nozzles

10 nozzles

1980 1985 1990 1995 2000 2005 2010 2015 2020

100’s microns

10’s microns

Drops/Second 1/Size (pL-1)

Printing and Imprinting Thin Film Transistors (TFT)

Can be transparent, bio-degradable and even ingestible

Unit cost 1000 less than mainstream CMOS CMOS @ $40,000/m2 vs. TFT @ $10/m2

Printing CAPEX can be less than $1,000 350dpi = 200um @ 20 m/s

Can print batteries, antenna

Mainly organic at ~20 volts

Imprint CAPEX a $2M DVD press is high volume Better controllability hence higher density and performance

1um today scale to 50nm features as used today for BluRay discs

Mainly Inorganic NMOS only at ~2 volts

Mobility Trends: CMOS & Thin Film Transistors

0.00001

0.0001

1990 1995 2000 2005 2010 2015 2020 2025

Conventional NMOS

Conventional PMOS

ARM1 3µ

CortexM0 2µ

Top Right

and Bottom Left

1998 Manual Partitioning

C & Assembler

2013 Manual Partitioning

C++ & OpenCL/RenderScript

ARM DSP ARM GPU

Vector Add Reduction Matrix Mul

GPU OpenCL on GPU 1.00 1.00 1.00

GPU OpenCL on FPGA 0.14 0.02 0.89

FPGA OpenCL on FPGA 1.71 1.62 31.85

Is There Anything New in Heterogeneous Computing?

How Do People Program?

Simple, old-school ray tracer

Start with C++ code and accelerate the code with Heterogeneous Systems void traceScreen() { for(y = 0; y < height; ++y) { for(x = 0; x < width; ++x){ Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); } } }

void traceScreen() { par_for_2D(height, width, [&](int y, int x) { Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); }); }

Mobile Web

Embedded ~200k

Desktop

~20M Programmers

Moving the Code onto OpenCL 1.x

Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject

b) Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals

c) Get rid of the virtual function calls

d) Change the classes to structs

e) Get rid of recursion in CSGObject

f) Avoid accessing the global scene variable in accelerated code

g) Port the code base to OpenCL C

Moving the Code onto OpenCL 2

b) Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals

OpenCL 2 solves point a) with shared address space, but not the rest

Moving the Code onto C++ AMP

b) Rewrite the use of std::vector, as C++ AMP cannot call into C++ standard library

C++ AMP solves points d), f) and g), but not the rest

Moving the Code onto HSA

b) Rewrite the use of std::vector, as HSAIL does not understand C++ data type internals

g) Port the code base to a language on top of HSAIL

HSA solves points a), c), d), e) and soon f)

What Makes GPUs Good For Power Efficient Compute?

Relaxed single-threaded performance No dynamic scheduling

No branch prediction

No register renaming, no result forwarding

Longer pipelines

Lower clock frequencies

Multi-threading Tolerate long latencies to memory

Increasing the ALU/control ratio Short-vectors exposed to programmers

SIMT/Warp/VLIW/Wavefront based execution

LITTLE big

Heterogeneous Compute Homogeneous Architecture

How about a SIMTish ARM? Familiar programming model, C++ and OpenMP

Fewer seams

Sharing data structures and function pointers/vtables

Throughput

Load/Store Pipe FP Pipe

Integer Pipe

RESEARCH

Moving the Code onto a Warped ARM

Need to make the following changes Get rid of all the pointers, both in scene vector and internally in CSGObject

Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals

Get rid of the virtual function calls

Change the classes to structs

Get rid of recursion in CSGObject

Avoid accessing the global scene variable in accelerated code

Port the code base to OpenCL C

Performance vs Effort

We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various ways, to investigate the tradeoff between programmer effort and performance payoff

SGEMM version Speedup Effort

ARM in C 1x Low

ARM in C with NEON intrinsics, prefetching 15x Medium - High

ARM in assembly with NEON, prefetching 26x High

SIMTish ARM in C 35x Low

SIMTish ARM in C, unrolled 44x Low - Medium

Mali GPU x 4 way 136x High

Scale Needs Standards

IPv6 Sonosnet

Works for geeks… No proper orchestration Battle for the apps platform Needs home IT support Or only single manufacturer

Imagine that there were a 1000 of these connected devices….

Functional Becomes the Internet of things

Functional Little Data

Life Insurance

Car Insurance

My Data

Their Data

X X Rob Curtis Haymakers Cambridge

Picture by Keith Jones

Sharing Needs Trust

IOT Medical Devices

First implantable Pacemaker 1958

Can a pacemaker be hacked to kill? Or just a plot line in US TV series

RF interface for adjusting settings

First hacked in 2008 “Sustained effort by a team of specialists” – The New York Times

Range a few cm

Today MIT grad students

One weekend

Range 50 feet

Trust Needs Security

It’s a Heterogeneous Future

Open Data and Objects

The future R

Smart Everything

SaaS M2M

Applications

Internet / broadband

Mobile Telephony

Sensors & Actuators Networks

Fixed Telephony Networks

Mobile internet

Scale Needs Standards Sharing Needs Trust Trust Needs Security

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM

c code

opencl c opencl

scene vector

code base

opencl c c amp

rid of recursion

accelerated code port

b c d e f gget

Technology

Erwin Muller

Muller Article

Jean Muller

309 Muller

Water security - a useful concept? What is needed to achieve...

Muller 2013

Visualization of Heterogeneous Data Mike Cammarano Xin...

Muller TestosteronePaternal

Keynote commentary by Mike Muller - CP meeting Day 1

Visualization of Heterogeneous Data - Computer...

Empirical testing of the CAPM on the JSE Mike Ward, Chris...

Mike Messina - Rocky Mountain Oracle Users Group Magazine...

Six Nations Wine Challenge Results - 2018 SKU Company Name.....

RPF 9 MAY 2007 TECHNOLOGY IN PRACTICE FEEDBACK ON CUTBACK...

GWP Technical Committee member Mike Muller about Climate...

Situacionismo muller