Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM

Post on 13-May-2015

1006 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.

Transcript

Mike Muller CTO

Is there anything new in heterogeneous computing?

Evolution

Computing

Embedded

PC

77

82

97

07 10

Mobile Computing

IOT

Cloud Server

1960 1970 1980 1990 2000 2010 2020

Wearable Intelligence

13

89

Consumer Smart Appliances

93

What’s the Innovation?

MEMS

CCD

Wireless 3 G

Semiconductor Process?

Media GPS

Social Media?

10

100

1,000

10,000

1990 1995 2000 2005 2010 2015 2020 2025

LELE

SADP

LELELE EUV

SAQP

EUV LELE

EUV + DWEB

EUV + DSA

FinFET HNW

III-V GE

VNW 2D: C, MoS

spintronics NEMS

Patterning

Planar CMOS

Al wires CU wires

Switches

Interconnect

NMOS PMOS

Mobility Trends: CMOS

14nm 10nm 7nm 5nm 3.5nm

HKMG Strain

// 3DIC Opto I/O Opto int Seq. 3D

Graphene wire, CNT via

cm2 /

(V·s

)

Printing: Moore’s Law and Ink Jets

10,000 nozzles

10 nozzles

1980 1985 1990 1995 2000 2005 2010 2015 2020

1E11

1E10

1E9

1E8

1E7

1E6

1E5

1E4

1E3

1E1

1E0

1E-1

1E-2

1E-3

100’s microns

10’s microns

Drops/Second 1/Size (pL-1)

Printing and Imprinting Thin Film Transistors (TFT)

Can be transparent, bio-degradable and even ingestible

Unit cost 1000 less than mainstream CMOS CMOS @ $40,000/m2 vs. TFT @ $10/m2

Printing CAPEX can be less than $1,000 350dpi = 200um @ 20 m/s

Can print batteries, antenna

Mainly organic at ~20 volts

Imprint CAPEX a $2M DVD press is high volume Better controllability hence higher density and performance

1um today scale to 50nm features as used today for BluRay discs

Mainly Inorganic NMOS only at ~2 volts

Mobility Trends: CMOS & Thin Film Transistors

0.00001

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

1990 1995 2000 2005 2010 2015 2020 2025

Conventional NMOS

Conventional PMOS

TFT

CPU

cm2 /

(V·s

)

ARM1 3µ

6MHz

CortexM0 2µ

20kHz

Top Right

and Bottom Left

1998 Manual Partitioning

C & Assembler

2013 Manual Partitioning

C++ & OpenCL/RenderScript

ARM DSP ARM GPU

Vector Add Reduction Matrix Mul

GPU OpenCL on GPU 1.00 1.00 1.00

GPU OpenCL on FPGA 0.14 0.02 0.89

FPGA OpenCL on FPGA 1.71 1.62 31.85

+ +

Is There Anything New in Heterogeneous Computing?

How Do People Program?

Simple, old-school ray tracer

Start with C++ code and accelerate the code with Heterogeneous Systems void traceScreen() { for(y = 0; y < height; ++y) { for(x = 0; x < width; ++x){ Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); } } }

void traceScreen() { par_for_2D(height, width, [&](int y, int x) { Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); }); }

Mobile Web

Embedded ~200k

Desktop

~20M Programmers

Moving the Code onto OpenCL 1.x

Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject

b) Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals

c) Get rid of the virtual function calls

d) Change the classes to structs

e) Get rid of recursion in CSGObject

f) Avoid accessing the global scene variable in accelerated code

g) Port the code base to OpenCL C

Moving the Code onto OpenCL 2

Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject

b) Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals

c) Get rid of the virtual function calls

d) Change the classes to structs

e) Get rid of recursion in CSGObject

f) Avoid accessing the global scene variable in accelerated code

g) Port the code base to OpenCL C

OpenCL 2 solves point a) with shared address space, but not the rest

Moving the Code onto C++ AMP

Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject

b) Rewrite the use of std::vector, as C++ AMP cannot call into C++ standard library

c) Get rid of the virtual function calls

d) Change the classes to structs

e) Get rid of recursion in CSGObject

f) Avoid accessing the global scene variable in accelerated code

g) Port the code base to OpenCL C

C++ AMP solves points d), f) and g), but not the rest

Moving the Code onto HSA

Need to make the following changes a) Get rid of all the pointers, both in scene vector and internally in CSGObject

b) Rewrite the use of std::vector, as HSAIL does not understand C++ data type internals

c) Get rid of the virtual function calls

d) Change the classes to structs

e) Get rid of recursion in CSGObject

f) Avoid accessing the global scene variable in accelerated code

g) Port the code base to a language on top of HSAIL

HSA solves points a), c), d), e) and soon f)

What Makes GPUs Good For Power Efficient Compute?

Relaxed single-threaded performance No dynamic scheduling

No branch prediction

No register renaming, no result forwarding

Longer pipelines

Lower clock frequencies

Multi-threading Tolerate long latencies to memory

Increasing the ALU/control ratio Short-vectors exposed to programmers

SIMT/Warp/VLIW/Wavefront based execution

LITTLE big

Heterogeneous Compute Homogeneous Architecture

How about a SIMTish ARM? Familiar programming model, C++ and OpenMP

Fewer seams

Sharing data structures and function pointers/vtables

..

Throughput

Load/Store Pipe FP Pipe

Integer Pipe

SIM

T

Que

ue

Wri

te

RESEARCH

Moving the Code onto a Warped ARM

Need to make the following changes Get rid of all the pointers, both in scene vector and internally in CSGObject

Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals

Get rid of the virtual function calls

Change the classes to structs

Get rid of recursion in CSGObject

Avoid accessing the global scene variable in accelerated code

Port the code base to OpenCL C

Performance vs Effort

We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various ways, to investigate the tradeoff between programmer effort and performance payoff

SGEMM version Speedup Effort

ARM in C 1x Low

ARM in C with NEON intrinsics, prefetching 15x Medium - High

ARM in assembly with NEON, prefetching 26x High

SIMTish ARM in C 35x Low

SIMTish ARM in C, unrolled 44x Low - Medium

Mali GPU x 4 way 136x High

Scale Needs Standards

IPv4

IPv6 Sonosnet

Works for geeks… No proper orchestration Battle for the apps platform Needs home IT support Or only single manufacturer

Imagine that there were a 1000 of these connected devices….

Functional Becomes the Internet of things

Functional Little Data

Mike

Life Insurance

Gym

Car Insurance

My Data

Their Data

!

X X Rob Curtis Haymakers Cambridge

Picture by Keith Jones

Sharing Needs Trust

IOT Medical Devices

First implantable Pacemaker 1958

Can a pacemaker be hacked to kill? Or just a plot line in US TV series

RF interface for adjusting settings

First hacked in 2008 “Sustained effort by a team of specialists” – The New York Times

Range a few cm

Today MIT grad students

One weekend

Range 50 feet

Trust Needs Security

It’s a Heterogeneous Future

Open Data and Objects

The future R

each

Smart Everything

SaaS M2M

Applications

Internet / broadband

Mobile Telephony

Sensors & Actuators Networks

Fixed Telephony Networks

Mobile internet

Scale Needs Standards Sharing Needs Trust Trust Needs Security

Today

top related