Top Banner
The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT http://www.cag.csail.mit.edu/raw
34

The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Mar 29, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

The Raw All Purpose Unit (APU)based on a Tiled-Processor Architecture

A Logical Successor to the CPU, GPU and NPU?

Anant Agarwal

MIT

http://www.cag.csail.mit.edu/raw

Page 2: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

A tiled processor architecture prototype: the Raw microprocessor

October 02

Page 3: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Embedded system:1020 Element Microphone Array

Page 4: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

1020 Node Beamformer Demo

. 2 People moving about and talking

. Track movement of people using camera (vision group) and display on monitor

. Beamformer focuses on speech of one person

. Select another person using mouse

. Beamformer switches focus to the speech of the other person

Page 5: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

The opportunity

20MIPS cpu1987

Page 6: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

2007

The billion transistor chip

The opportunity

Page 7: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Enables seeking out new application domains

o Redefine our notion of a “general purpose processor”

o Imagine a single-chip handheld that is a speech

driven cellphone, camera, PDA, MP3 player, video engineo Imagine a single-chip PC that is also a 10G router, wireless access point, graphics engine

o While running the gamut of existing desktop binaries

-- A new versatile processor -- APU**Asanovic

Page 8: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

But, where is general purpose computing today?

Other…

Encryption

Sound

Ethernet

Wireless

Graphics 0.25TFLOPS

X86Pentium IV

ASICs

Page 9: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

How does the ASIC do it?

Lots of ALUs, lots of registers, lots of small memories Hand-routed, short wires

Lower power (everything close by) Stream data model for high throughput

But, not general purpose

memmem

mem

mem

mem

Page 10: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Our challenge

How to exploit ASIC-like features Lots of resources like ALUs and memoriesApplication-specific routing of short wires

While being “general purpose”ProgrammableAnd even running ILP-based sequential programs

One Approach: Tiled Processor Architecture (TPA)

Page 11: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Tiled Processor Architecture (TPA)

SMEM

SWITCHPC

Short, prog. wires

DMEM

IMEM

REGPC

FPU

ALU

Lots of ALUs and regs

Tile

Lower power

Programmable. Supports ILP and Streams

Page 12: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

A Prototype TPA: The Raw Microprocessor

The Raw Chip

Software-scheduled interconnects (can use static or dynamic routing – but compiler determines instruction placement and routes)

Tile

Disk stream

Video1

RDRAM

Packet stream

A Raw Tile

SMEM

SWITCHPC

DMEM

IMEM

REGPC

FPU

ALU

Raw Switch

PC

SMEM[Billion transistor IEEE Computer Issue ’97]

Page 13: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Tight integration of interconnect

The Raw ChipTile

Disk stream

Video1

RDRAM

Packet stream

A Raw Tile

SMEM

SWITCHPC

DMEM

IMEM

REGPC

FPU

ALUIF RFD

A TL

M1

F P

E

U WB

r26

r27

r25

r24

InputFIFOsfromStaticRouter

r26

r27

r25r24

OutputFIFOstoStaticRouter

0-cycle“local bypassnetwork”

M2

TV

F4

Point-to-point bypass-integratedcompiler-orchestratedon-chip networks

Page 14: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Tile 11

fmul r24, r3, r4

softwarecontrolledcrossbar

softwarecontrolledcrossbar

fadd r5, r3, r25

route P->E route W->P

Tile 10

How to “program the wires”

Page 15: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

The result of orchestrating the wires

CustomDatapathPipeline

mem

mem

mem

httpd

C programILP computation

MPI program

Zzzz

Page 16: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Perspective

We have replacedBypass paths, ALU-reg bus, FPU-Int. bus, reg-cache-bus, cache-mem bus, etc.

With a general, point-to-point, routed interconnect called:

Scalar operand network (SON)Fundamentally new kind of

network optimized for both scalar and stream transport

Page 17: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Programming models and software for tiled processor architectures

o Conventional scalar programs (C, C++, Java)Or, how to do ILP

o Stream programs

Page 18: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Scalar (ILP) program mapping

v2.4 = v2seed.0 = seedv1.2 = v1pval1 = seed.0 * 3.0pval0 = pval1 + 2.0tmp0.1 = pval0 / 2.0pval2 = seed.0 * v1.2tmp1.3 = pval2 + 2.0pval3 = seed.0 * v2.4tmp2.5 = pval3 + 2.0pval5 = seed.0 * 6.0pval4 = pval5 + 2.0tmp3.6 = pval4 / 3.0pval6 = tmp1.3 - tmp2.5v2.7 = pval6 * 5.0pval7 = tmp1.3 + tmp2.5v1.8 = pval7 * 3.0v0.9 = tmp0.1 - v1.8v3.10 = tmp3.6 - v2.7tmp2 = tmp2.5v1 = v1.8;tmp1 = tmp1.3v0 = v0.9tmp0 = tmp0.1v3 = v3.10tmp3 = tmp3.6v2 = v2.7

E.g., Start with a C program, and several transformations later:

Lee, Amarasinghe et al, “Space-time scheduling”, ASPLOS ‘98

Existing languages will work

Page 19: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

Scalar program mappingv2.4 = v2seed.0 = seedv1.2 = v1pval1 = seed.0 * 3.0pval0 = pval1 + 2.0tmp0.1 = pval0 / 2.0pval2 = seed.0 * v1.2tmp1.3 = pval2 + 2.0pval3 = seed.0 * v2.4tmp2.5 = pval3 + 2.0pval5 = seed.0 * 6.0pval4 = pval5 + 2.0tmp3.6 = pval4 / 3.0pval6 = tmp1.3 - tmp2.5v2.7 = pval6 * 5.0pval7 = tmp1.3 + tmp2.5v1.8 = pval7 * 3.0v0.9 = tmp0.1 - v1.8v3.10 = tmp3.6 - v2.7tmp2 = tmp2.5v1 = v1.8;tmp1 = tmp1.3v0 = v0.9tmp0 = tmp0.1v3 = v3.10tmp3 = tmp3.6v2 = v2.7

Graph

Program code

Page 20: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

Program graph clustering

Page 21: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Placement

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0 tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

Tile1

Tile2

Tile3Tile4

Page 22: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Routing

Processor code

seed.0=recv()pval5=seed.0*6.0pval4=pval5+2.0tmp3.6=pval4/3.0tmp3=tmp3.6v2.7=recv()v3.10=tmp3.6-v2.7v3=v3.10

route(W,S,t)

route(W,S)

route(S,t)

Tile2

v2.4=v2seed.0=recv(0)pval3=seed.o*v2.4tmp2.5=pval3+2.0tmp2=tmp2.5send(tmp2.5)

tmp1.3=recv()pval6=tmp1.3-tmp2.5v2.7=pval6*5.0Send(v2.7)v2=v2.7

route(N,t)

route(t,E)route(E,t)

route(t,E)

Tile3

v1.2=v1seed.0=recv()pval2=seed.0*v1.2tmp1.3=pval2+2.0 send(tmp1.3)tmp1=tmp1.3tmp2.5=recv()pval7=tmp1.3+tmp2.5v1.8=pval7*3.0v1=v1.8tmp0.1=recv()v0.9=tmp0.1-v1.8v0=v0.9

route(N,t)

route(t,W)route(W,t)

route(N,t)route(W,N)

Tile

4

seed.0=seedsend(seed.0)pval1=seed.0*3.0pval0=pval1+2.0tmp0.1=pval0/2.0send(tmp0.1)tmp0=tmp0.1

route(t,E,S)

route(t,E)

Tile

1

Switch code

Page 23: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Instruction Scheduling

seed.0=recv()pval5=seed.0*6.0

pval4=pval5+2.0tmp3.6=pval4/3.0

tmp3=tmp3.6

v2.7=recv()v3.10=tmp3.6-v2.7

v3=v3.10

route(W,t)

route(W,S)

route(S,t)

send(seed.0)pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

send(tmp0.1)tmp0=tmp0.1

route(t,E)

route(t,E)

v2.4=v2

seed.0=recv(0)pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5send(tmp2.5)

tmp1.3=recv()pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

Send(v2.7)v2=v2.7

route(N,t)

route(t,E)

route(E,t)

route(t,E)

v1.2=v1

seed.0=recv()pval2=seed.0*v1.2

tmp1.3=pval2+2.0

send(tmp1.3)tmp1=tmp1.3

tmp2.5=recv()pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

route(N,t)

route(N,t)

route(W,N)

seed.0=seed

route(W,S)

route(W,S)

tmp0.1=recv()

route(t,W)

route(W,t)

route(W,N)

route(t,E)

Tile1Tile3 Tile4 Tile2

time

Page 24: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

class BeamFormer extends Pipeline {void init(numChannels, numBeams) {

add(new SplitJoin() { void init() {

setSplitter(Duplicate());for (int i=0; i<numChannels; i++) {

add(new FIR1(N1));

add(new FIR2(N2));}setJoiner(RoundRobin()); }});

}add(new SplitJoin() {

void init() {setSplitter(Duplicate());for (int i=0; i<numBeams; i++) {

add(new VectorMult());

add(new FIR3(N3));

add(new Magnitude());

add(new Detect());

}setJoiner(Null()); }});

}

StreamIt: Stream Language and Compiler

Splitter

FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter

FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter

Joiner

Splitter

Detector

Magnitude

FirFilter

Vector Mult

Detector

Magnitude

FirFilter

Vector Mult

Detector

Magnitude

FirFilter

Vector Mult

Detector

Magnitude

FirFilter

Vector Mult

Joiner

e.g., BeamFormer Amarasinghe et al.

Page 25: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

mem

+mem

+mem

+mem

+

mem

+

mem

+

mem

+

mem

+mem

+

mem

+

mem

+

mem

+

Search

FRM

FIR

Control

Raw Beamformer Layout (by hand)

Page 26: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Raw die photo

.18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta)Of course, custom IC designed by industrial design team

could do much better

Page 27: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Raw motherboard

Page 28: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

More Experimental Systems

Systems Online or in PipelineRaw Workstation Raw-based 1020 Microphone

ArrayRaw 802.11a/g wireless system

(collab with Engim)Raw Gigabit IP routerRaw graphics systemRaw supercomputing fabric

Page 29: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Empirical EvaluationCompare to P3 implemented in similar technology

Parameter Raw (IBM ASIC) P3 (Intel)

Litho 180 nm 180 nm

Process CMOS 7SF P858

Metal Layers Cu 6 Al 6

FO1 Delay 23 ps 11 ps

Dielectric k 4.1 3.55

Design Style Standard Cell

SA27E ASIC

Full custom

Initial Freq 425 MHz 500-733 MHz (use 600)

Die Area 331 mm2 106 mm2

Raw #s from cycle-accurate simulator validated against real chip-- FPGA mem controller in Raw -- Raw SW i-caching adds 0-30% ovhd

Page 30: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

[ISCA’04]

Perf

orm

an

ce

Architecture Space

Performance Results~10x parallelism~ 4x ld/store elim~ 4x stream mem bw

Page 31: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Raw IP Router

Gb/sec

Page 32: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

VersaBench

Sharing a benchmark set to stress versatility of processors

Categories of programs:ILP – Desktop and ScientificStreamsThroughput oriented serversBit-level embedded

www.cag.csail.mit.edu/versabench

Page 33: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .

Summary

Raw: single chip for ILP and streaming

Scalar operand network is key toILP and streams

Can enable the APU

www.cag.csail.mit.edu/raw

Page 34: The Raw All Purpose Unit (APU) based on a Tiled-Processor Architecture A Logical Successor to the CPU, GPU and NPU? Anant Agarwal MIT .