Page 1
NAXNear-Data Approximate Computing
Georgia Institute of Technology
Amir Yazdanbakhsh Jacob Sacks Choungki Song1
Hadi EsmaeilzadehPejman Lotfi-Kamran2 Nam Sung-Kim3
1 University of Wisconsin-Madison
3 University of Illinois at Urbana-Champaign
2 The Institute for Research in Fundamental Sciences
Page 2
2
Approximate ComputingEmbracing Imprecision
Relax theabstractionof“nearperfect” accuracy in
Acceptimprecision toimprove
performanceenergy dissipationresourceutilizationefficiency
DataProcessing Storage Communication
Page 3
3
VirtualReality
DataAnalytics
MachineLearning
MultimediaProcessing
NGPU
SM SM SM SM
SM SM SM SM
SM SM SM SM
SM SM SM SM
GPU
Page 4
VirtualReality
DataAnalytics
MachineLearning
MultimediaProcessing
NGPU
SM SM SM SM
SM SM SM SM
SM SM SM SM
SM SM SM SM
4
GPU
DiverseclassesofGPUapplications
areamenableto“approximation”.
Page 5
5
Neural Transformation for GPUs
NeuralNetwork
NeuralNetwork
Page 6
6
Neural Network Operations
xj,ixj,0 xj,n
wj,0
wj,i wj,n
...wj,0
...yj =
sigmoid(
wj,0 ⇥ xj,0 +
. . .
wj,i ⇥ xj,i +
. . .
wj,n ⇥ xj,n +
)yj
Page 7
7
Runtime Breakdown of Baseline GPU
AmirYazdanbakhsh,etal.,“NeuralAccelerationforGPUThroughputProcessors”,MICRO2015.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
Nor
mal
ized
Run
time Data Processing Data Communication
Page 8
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
8
Runtime Breakdown of NGPU
AmirYazdanbakhsh,etal.,“NeuralAccelerationforGPUThroughputProcessors”,MICRO2015.
Nor
mal
ized
Run
time Data Processing Data Communication
45%
Page 9
9
In-DRAM Computing Challenges
DRAMiscost-sensitive!
Page 10
10
In-DRAM Computing Challenges
DRAMisunderpower constraint!
Page 11
11
In-DRAM Computing Challenges
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
core core core core core core core core
GPUisSIMD!
Page 12
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
12
Near-Data Approximate Computing
In-DRAMCtrl
Page 13
13
Near-Data Approximate Computing
A B C DI/O S/ACO
LDEC
COLDEC
RD RD RD RD
IOCNTBitline
...
...
...Arithmetic
UnitArithmetic
Unit
Sigmoid LUT
Sigmoid LUT
Weight Register
Arithmetic Unit
Sigmoid LUT
Read Data
Write Data
Half-bank Half-bank Half-bank Half-bank
Page 14
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
14
NAX Execution Flow
1In-DRAM
Ctrl
Page 15
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
15
NAX Execution Flow
2
In-DRAMCtrl
Page 16
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
16
NAX Execution Flow
3In-DRAM
Ctrl
Page 17
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
17
NAX Execution Flow
4In-DRAM
Ctrl
Page 18
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
18
NAX Execution Flow
5 In-DRAMCtrl
Page 19
Inte
rcon
nect
ion
Netw
ork
L2Cache
Memory Controller
MemoryPartition
StreamingMultiprocessor
(SM)
A
E
B
F
C
G
D
H
A
E
B
F
C
G
D
H
I
M
J
N
K
O
L
P
I
M
J
N
K
O
L
P
DRAM Logic
AcceleratorLogic
19
NAX Execution Flow
6
In-DRAMCtrl
Page 20
20
NAX Microarchitectures
input register
shifter shift register
output register
contr
oll
er
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
FloatingPoint
FixedPoint
Page 21
21
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
Page 22
22
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
6 0
Page 23
23
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
4
Page 24
24
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
3
Page 25
25
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S00 = (00110)2
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
1
Page 26
26
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Iteration 1
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T1 = Xi�6 + 0 = (8000)10
Error = 28.9%
S00 = (00110)2
(8000)10
Page 27
27
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S01 = (00100)2
S02 = (00011)2
S03 = (00001)2
Iteration 2
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T2 = Xi�4 + T1 = (10000)10
Error = 11.2%
(2000)10
Page 28
28
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S02 = (00011)2
S03 = (00001)2
Iteration 3
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T3 = Xi�3 + T2 = (11000)10
Error = 2.3%
(1000)10
Page 29
29
Simplification of Integrated Arithmetic
input register
shifter shift register
output register
cont
rolle
r
LUT
+
Xi
S03 = (00001)2
Iteration 4
Wi = (01011010)2 = (90)10
Xi = (01111101)2 = (125)10
Yi = Xi x Wi = (11,250)10
T4 = Xi�1 + T3 = (11250)10
Error = 0.0%
(250)10
Page 30
30
Experimental Setup
Power Model • TechnologyNode40nm(3-LayersMetal)
• Synopsys,Cadence• GPUWattch,McPAT andCACTI,Verilog
GPU Simulator• GPGPU-SimCycle-LevelSimulator
• Fermi-basedGTX480,Shader CoreFrequency1.4GHz
• NVCCCompiler–O3
MachineLearning,Finance,Vision3DGaming,MedicalImaging
NumericalAnalysis,ImageProcessing
Page 31
31
NAX Speedup Compared to NGPU
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Spee
dup
x
x
x
x
xx
x
x
NAX-AFxPNAX-FxPNAX-FP
2.0x
2.0x
1.2x
NAX-AFxP provides 1.2x speedup compared to NGPU.
Page 32
32
NAX Energy Saving Compared to NGPU
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
Ener
gy S
avin
g
NAX-AFxP provides 4.8x energy saving compared to NGPU.
4.8x
xxxxxxx
xx NAX-AFxPNAX-FxPNAX-FP
Page 33
0.0
0.5
1.0
1.5
2.0
2.5
3.0
33
DRAM System PowerD
RA
M S
yste
mPo
wer
Incr
ease
NAX-AFxP yields to a 0.7x lower DRAM system power.
x
x
x
x
x
x
x
NAX-AFxPNAX-FxPNAX-FP
Lower is better
0.7x
Page 34
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
34
Application Quality LossQ
ualit
y Lo
ss
Quality loss is below 10% in all applications except one.
NAX-AFxPNAX-FxPNAX-FP
Page 35
35
NAX: Near-Data Approximate Computing
4.8X Energy Saving1.2X Speedup
Ove
rhea
dB
enef
its
over
NG
PU2% Area Overheadper DRAM Chip
≤ 10% Quality Loss
0.7X DRAM System Power