NAX Near-Data Approximate Computingayazdanb/publication/slides/nax-ac16-slides.pdfNAX Near-Data Approximate Computing ... , Vision 3D Gaming, Medical Imaging ... 5.0 6.0 7.0 8.0 g

NAXNear-Data Approximate Computing

Georgia Institute of Technology

Amir Yazdanbakhsh Jacob Sacks Choungki Song1

Hadi EsmaeilzadehPejman Lotfi-Kamran2 Nam Sung-Kim3

1 University of Wisconsin-Madison

3 University of Illinois at Urbana-Champaign

2 The Institute for Research in Fundamental Sciences

2

Approximate ComputingEmbracing Imprecision

Relax theabstractionof“nearperfect” accuracy in

Acceptimprecision toimprove

performanceenergy dissipationresourceutilizationefficiency

DataProcessing Storage Communication

3

VirtualReality

DataAnalytics

MachineLearning

MultimediaProcessing

NGPU

SM SM SM SM

SM SM SM SM

SM SM SM SM

SM SM SM SM

GPU

VirtualReality

DataAnalytics

MachineLearning

MultimediaProcessing

NGPU

SM SM SM SM

SM SM SM SM

SM SM SM SM

SM SM SM SM

4

GPU

DiverseclassesofGPUapplications

areamenableto“approximation”.

5

Neural Transformation for GPUs

NeuralNetwork

NeuralNetwork

6

Neural Network Operations

xj,ixj,0 xj,n

wj,0

wj,i wj,n

...wj,0

...yj =

sigmoid(

wj,0 ⇥ xj,0 +

. . .

wj,i ⇥ xj,i +

. . .

wj,n ⇥ xj,n +

)yj

7

Runtime Breakdown of Baseline GPU

AmirYazdanbakhsh,etal.,“NeuralAccelerationforGPUThroughputProcessors”,MICRO2015.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

Nor

mal

ized

Run

time Data Processing Data Communication

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

8

Runtime Breakdown of NGPU

AmirYazdanbakhsh,etal.,“NeuralAccelerationforGPUThroughputProcessors”,MICRO2015.

Nor

mal

ized

Run

time Data Processing Data Communication

45%

9

In-DRAM Computing Challenges

DRAMiscost-sensitive!

10


DRAMisunderpower constraint!

11


core core core core core core core core








GPUisSIMD!

Inte

rcon

nect

ion

Netw

ork

L2Cache

Memory Controller

MemoryPartition

StreamingMultiprocessor

(SM)

A

E

B

F

C

G

D

H

A

E

B

F

C

G

D

H

I

M

J

N

K

O

L

P

I

M

J

N

K

O

L

P

DRAM Logic

AcceleratorLogic

12

Near-Data Approximate Computing

In-DRAMCtrl

13

Near-Data Approximate Computing

A B C DI/O S/ACO

LDEC

COLDEC

RD RD RD RD

IOCNTBitline

...

...

...Arithmetic

UnitArithmetic

Unit

Sigmoid LUT

Sigmoid LUT

Weight Register

Arithmetic Unit

Sigmoid LUT

Read Data

Write Data

Half-bank Half-bank Half-bank Half-bank

Inte

rcon

nect

ion

Netw

ork

L2Cache

Memory Controller

MemoryPartition


(SM)

A

E

B

F

C

G

D

H

A

E

B

F

C

G

D

H

I

M

J

N

K

O

L

P

I

M

J

N

K

O

L

P

DRAM Logic

AcceleratorLogic

14

NAX Execution Flow

1In-DRAM

Ctrl

Inte

rcon

nect

ion

Netw

ork

L2Cache

Memory Controller

MemoryPartition


(SM)

A

E

B

F

C

G

D

H

A

E

B

F

C

G

D

H

I

M

J

N

K

O

L

P

I

M

J

N

K

O

L

P

DRAM Logic

AcceleratorLogic

15

NAX Execution Flow

2

In-DRAMCtrl

Inte

rcon

nect

ion

Netw

ork

L2Cache

Memory Controller

MemoryPartition


(SM)

A

E

B

F

C

G

D

H

A

E

B

F

C

G

D

H

I

M

J

N

K

O

L

P

I

M

J

N

K

O

L

P

DRAM Logic

AcceleratorLogic

16

NAX Execution Flow

3In-DRAM

Ctrl

Inte

rcon

nect

ion

Netw

ork

L2Cache

Memory Controller

MemoryPartition


(SM)

A

E

B

F

C

G

D

H

A

E

B

F

C

G

D

H

I

M

J

N

K

O

L

P

I

M

J

N

K

O

L

P

DRAM Logic

AcceleratorLogic

17

NAX Execution Flow

4In-DRAM

Ctrl

Inte

rcon

nect

ion

Netw

ork

L2Cache

Memory Controller

MemoryPartition


(SM)

A

E

B

F

C

G

D

H

A

E

B

F

C

G

D

H

I

M

J

N

K

O

L

P

I

M

J

N

K

O

L

P

DRAM Logic

AcceleratorLogic

18

NAX Execution Flow

5 In-DRAMCtrl

Inte

rcon

nect

ion

Netw

ork

L2Cache

Memory Controller

MemoryPartition


(SM)

A

E

B

F

C

G

D

H

A

E

B

F

C

G

D

H

I

M

J

N

K

O

L

P

I

M

J

N

K

O

L

P

DRAM Logic

AcceleratorLogic

19

NAX Execution Flow

6

In-DRAMCtrl

20

NAX Microarchitectures

input register

shifter shift register

output register

contr

oll

er

LUT

+

Xi

S00 = (00110)2

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

FloatingPoint

FixedPoint

21

Simplification of Integrated Arithmetic

input register


output register

cont

rolle

r

LUT

+

Xi

S00 = (00110)2

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

22


input register


output register

cont

rolle

r

LUT

+

Xi

S00 = (00110)2

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

6 0

23


input register


output register

cont

rolle

r

LUT

+

Xi

S00 = (00110)2

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

4

24


input register


output register

cont

rolle

r

LUT

+

Xi

S00 = (00110)2

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

3

25


input register


output register

cont

rolle

r

LUT

+

Xi

S00 = (00110)2

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

1

26


input register


output register

cont

rolle

r

LUT

+

Xi

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

Iteration 1

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

T1 = Xi�6 + 0 = (8000)10

Error = 28.9%

S00 = (00110)2

(8000)10

27


input register


output register

cont

rolle

r

LUT

+

Xi

S01 = (00100)2

S02 = (00011)2

S03 = (00001)2

Iteration 2

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

T2 = Xi�4 + T1 = (10000)10

Error = 11.2%

(2000)10

28


input register


output register

cont

rolle

r

LUT

+

Xi

S02 = (00011)2

S03 = (00001)2

Iteration 3

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

T3 = Xi�3 + T2 = (11000)10

Error = 2.3%

(1000)10

29


input register


output register

cont

rolle

r

LUT

+

Xi

S03 = (00001)2

Iteration 4

Wi = (01011010)2 = (90)10

Xi = (01111101)2 = (125)10

Yi = Xi x Wi = (11,250)10

T4 = Xi�1 + T3 = (11250)10

Error = 0.0%

(250)10

30

Experimental Setup

Power Model • TechnologyNode40nm(3-LayersMetal)

• Synopsys,Cadence• GPUWattch,McPAT andCACTI,Verilog

GPU Simulator• GPGPU-SimCycle-LevelSimulator

• Fermi-basedGTX480,Shader CoreFrequency1.4GHz

• NVCCCompiler–O3

MachineLearning,Finance,Vision3DGaming,MedicalImaging

NumericalAnalysis,ImageProcessing

31

NAX Speedup Compared to NGPU

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Spee

dup

x

x

x

x

xx

x

x

NAX-AFxPNAX-FxPNAX-FP

2.0x

2.0x

1.2x

NAX-AFxP provides 1.2x speedup compared to NGPU.

32

NAX Energy Saving Compared to NGPU

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

Ener

gy S

avin

g

NAX-AFxP provides 4.8x energy saving compared to NGPU.

4.8x

xxxxxxx

xx NAX-AFxPNAX-FxPNAX-FP

0.0

0.5

1.0

1.5

2.0

2.5

3.0

33

DRAM System PowerD

RA

M S

yste

mPo

wer

Incr

ease

NAX-AFxP yields to a 0.7x lower DRAM system power.

x

x

x

x

x

x

x


Lower is better

0.7x

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

34

Application Quality LossQ

ualit

y Lo

ss

Quality loss is below 10% in all applications except one.


35

NAX: Near-Data Approximate Computing

4.8X Energy Saving1.2X Speedup

Ove

rhea

dB

enef

its

over

NG

PU2% Area Overheadper DRAM Chip

≤ 10% Quality Loss

0.7X DRAM System Power

NAX Near-Data Approximate Computingayazdanb/publication/slides/nax-ac16-slides.pdfNAX Near-Data Approximate Computing ... , Vision 3D Gaming, Medical Imaging ... 5.0 6.0 7.0 8.0 g

Documents