Energy-Precision Tradeoffs in the Graphics Pipeline Jeff Pool March 19 th , 2012
Feb 25, 2016
Energy-Precision Tradeoffs in the Graphics Pipeline
Jeff PoolMarch 19th, 2012
2
MotivationWhy energy?
It matters everywhere:- Mobile devices- Desktop computers- Servers, data centers
It’s a bottleneck to performance!
http://img717.imageshack.us/img717/3936/1101771coolitomni.jpg
http://www.ornl.gov/ornlhome/images/casl/TVA%20Watts%20Bar.jpg
3
MotivationWhy precision?
Sign Exponent Mantissa
IEEE 754-2008 Single-Precision Floating-Point Representation
4
Don’t do Unnecessary Work• Max precision isn’t needed:
– 8-10 bit color buffers– FP32 => 24 bits of precision– Potentially lots of wasted effort!
• It’s certainly more complicated, but worth exploring
5
My ApproachVariable-precision computations- Reduce the precision when possible: 12.5 mantissa
bits used- Save energy in arithmetic: 70% less
energy- Low errors: 0.086%
differenceFull-Precision Arithmetic Reduced-Precision Arithmetic
6
My ApproachCommunicate fewer bits - Since fewer bits are used in computation - Most DRAM traffic is already compressed
Crysis, 2007
Variable-precision compression:(on sample frame)
- Geometry improved by 12%- Depth improved by 83%
7
GPU Global Memory Texture Frame-
Data Buffer
The Graphics Pipeline
Vertex Shader
Rasterization
Pixel Shader
Background
8
GPUs: A Brief History
Time
Cap
abili
ty
GPU
(NOT to scale!)
Fixe
d-Fu
nctio
n
Prog
ram
mab
ility
GPG
PU
CU
DA,
Str
eam
, O
penC
L
Shader Program
Compute Program
1.53, 32.8, …
10
Thesis StatementReducing the work done in the modern
graphics pipeline through novel communication and variable-precision computation techniques can enable a tradeoff between energy savings and image fidelity, leading to significant energy savings without perceptible loss of image quality.
11
How?Proving this thesis:
– Show that induced errors are imperceptible– Show significant energy savings
• Find energy consumed by entire pipeline• Find energy savings possible in each stage
12
Roadmap• My work
– Energy model– Energy savings in computation– Energy savings in communication
• Conclusions• Future work
13
Roadmap• My work
– Energy model– Energy savings in computation– Energy savings in communication
• Conclusions• Future work
14
Why an Energy Model?So I’ll know how much difference saving energy in different stages actually makes, know where to focus• Provides researchers/developers a tool to
predict energy usagePast Work Validated? Graphics? Simple
?Brooks et al., 2000Eisley et al., 2006Shaeffer et al., 2004Ramani et al., 2007Nagasaka et al., 2010Hong and Kim, 2010Zhang et al., 2011
15
Strategy• Model construction
– Experimentally measure energy for each operation
• Energy prediction– Profile a scene for operations performed– Predict total energy consumption (dot
product)• Validation
– Compare prediction with measured energy
16
What Operations?• Arithmetic
– ADD, MUL, SIN/COS, POW, LOG, …• Memory
– Local/Global Load/Store• Programmable
– Vertex/Pixel Shaders• Fixed-function
– Rasterization, Texture filtering
Explicit
Implicit
17
Measuring Energy in the GPU
Explicit• GPGPU
– Runs on same hardware as graphics
– No ambiguity in operations
• Simple microkernels– Little/no overhead– 10s runtime– Directed tests per
operation
Implicit• OpenGL• Enable/Disable
operation in question– Difference in energy is
the operation’s contribution
– Not as straightforward• Ex.: Texture filtering
18
Experimental Setup• NVIDIA 8300GS graphics card• Adex Electronics’ PEX16LX PCI riser to interrupt
power from motherboard
• Supply metered power to the card– 12V– 3.3V– 12V (fan, not counted in energy)
• Log runtimes/framerates, measure current as tests run
http://www.pretaktovanie.sk/obr/spotreba/eng/PICTURES/P1010283_ENG.jpg
19
ResultsOperation Energy (nJ)
Arithmetic 0.4 – 22.9
MemoryLocal load 1.49Local store 1.49Global load* 8.39 – 67.40Global store* 5.19 – 42.70 *Depending on type of access
Rasterization (per pixel) 0.24
Texture filtering (per pixel) 7.0 – 13.8
20
Profiling Operations Performed
• Use Microsoft’s PIX to log a frame of a running application:– Framebuffer contents– Vertex data– Render states– Vertex shaders– Pixel shaders– Per draw call
(100-1000s per frame)
• From all this data, extract operations
21
Validation• Three different applications, four scenes
– Real-world games to test the developed model• Harvested data, predict energy usage• Measured real energy usage, compare
Half Life 2: Lost Coast(High/Low Rendering
Qualities)
Batman: Arkham Asylum
Mass Effect
22
Validation Results
Batman HL2_low HL2_high Mass Effect0
100
200
300
400
500
600
700MeasuredPredicted
Test Scene
Ener
gy (
mJ)
Overheads
23
What Uses the Energy?
Batman
HL2_lo
w
HL2_h
igh
Mas
s Effec
t0
100200300400500600700
FB-WriteFB-ReadZ-WriteZ-ReadPS-MemoryPS-ArithmeticRasterizationVSRead Geometry
Test Scene
Esti
mat
ed E
nerg
y (m
J)
24
Roadmap• My work
– Energy model– Energy savings in computation– Energy savings in communication
• Conclusions• Future work
25
Where Does the Power Go?
Ptotal = Pdynamic + Pstatic
Power
Ground
CMOS Inverter
26
Energy-Saving Techniques
Clock gating (Park et al., 2010)Signal gating (Huang and Ercegovac, 2003)Power gating
– Coarse (Usami et al., 2009, Sjalander et al., 2005)
– Fine (My work)
Ptotal = Pdynamic + Pstatic
27
!Enable
Example: 1-Bit Adder
Cin
A
BCout
S
28
HW Results
SPICE simulations of:Adders: linear savings
Multipliers: quadratic savings
29
Precision in RenderingVariable-Precision fixed-function CPU
rendering– Hao and Varshney, 2001– 3 key differences: GPU, FP32,
programmabilityDepth buffer comparator
– Hensley, Singh, and Lastra, 2005Triangle separation for correct occlusion
– Akeley and Su, 2006
30
VARIABLE-PRECISION PIXEL SHADERS
So, we have hardware, let’s see what happens in
31
A Pixel Shader
32
Exaggerated Texture Coordinate Errors
Blocky textures(8 mantissa bits)
Original frame(24 mantissa bits)
33
Arithmetic Errors
… Different?(8 mantissa bits)
Original frame(24 mantissa bits)
34
Exaggerated Arithmetic Errors
Clearly different(4 mantissa bits)
Original frame(24 mantissa bits)
35
Different Errors,Different Tolerances
• Colors can be pushed far lower– 12, 10, 8 bits for color components (plus
one for rounding)
• Texture coordinates may need to be fully precise!
36
So, Treat Them Separately
37
So, Treat Them Separately
ACould contribute to texture coordinates
38
So, Treat Them Separately
A
B
Could contribute to texture coordinates
Will NOT contribute to texture coordinates
39
Precision Selection Strategies
• Statically• Artist-directed• Automatic closed-loop
40
Static Program Analysis
9 bits10 bits
12 bits
9 bits10 bits And so on…
11 bits
41
Artist-Directed PrecisionsPrecisions are chosen as the effect is designed
42
Automatic Closed-Loop Precision Selection
Run time feedback controlPer-shader error detection and precision
control
Error DetectionRenderer
Controller
Display
Prec
isio
n
Reduced Pixel
Full Pixel(sparsely sampled)
Error
Reduced Pixel
43
Experimental SetupStatic analysis
– Analyze shaders to find minimum safe operating precision
Artist-directed– Modify several demo applications– Allow the artist to choose precisions
Automatic closed-loop– Modify the ATTILA GPU simulator– Apply several feedback control schemes– Several test scenes
44
Data SetsData Set Static Directed Automatic
Closed-Loop
Depth of Field
Parallax Mapping
SSAO
Half Life 2: Lost Coast
Doom 3
Need for Speed: UndercoverMetaballs
45
Data Sets
46
Results: PrecisionsData Set Static Directed Automatic
Closed-Loop
Depth of Field 18.5 12.0 -Parallax Mapping 23.3 15.2 -SSAO 20.1 13.0 -Half Life 2: Lost Coast 19.1 - 13.2Doom 3 19.7 - 14.7Need for Speed: Undercover
21.8 - 16.5Metaballs 9.7 - 8.9
Lower is Better!
47
Results: Closed-Loop ErrorsUnnoticeable in practice
48
Results: % Energy SavingsData Set Static Directed Automatic
Closed-Loop
Depth of Field 33% 79% -
Parallax Mapping -2% 61% -
SSAO 49% 71% -
Half Life 2: Lost Coast 33% - 75%
Doom 3 15% - 69%Need for Speed: Undercover
2% - 50%
Metaballs 87% - 90%
Higher is Better!
Overall Energy: 2/3 1/5
49
Which Precision Selection Method?
Approach Savings HW Complexity
Artist Effort
Static Low Low Low
Directed High Low Medium
Automatic Closed-Loop High High Low
50
Directed Approach• High savings
– 70-80% in arithmetic– 10-20% overall GPU energy
• (by arithmetic alone!)• Low errors
– Acceptable by design– Quantitatively low (PSNR, % error)
51
Variable Precision Geometry• Vertex shaders• Similarly high savings (55-80%)• Different types of errors
– XY Screen-space– Depth
52
XY Screen-Space Errors8 bits of precision
53
Depth Errors16 bits of precision
54
Depth Matters (Some)• Far before XY errors• Even in unmodified commercial games
http://underpop.free.fr/j/java/developing-games-in-java/1592730051_ch10lev1sec5.htmlhttps://encrypted-tbn3.google.com/images?q=tbn:ANd9GcRWhmviKHKMGVAU1ooXrAzJxa_2IlknTI6cRT4MGfJyTpaZNYw-MA
55
COMMUNICATING LESS DATA
Variable-precision computation works. Let’s look at
56
Energy Savings in Communication
• Off-chip: compression (most data!)– Strom et al. (2008), Rasmussen et al. (2007,
2009)• 16 bit positive color/depth values• I adapt their approach to my needs
– Negative numbers– 32 bits– General values
• On-chip: bus encoding, caching– Reduced precision data freeze unused
lines
57
Unified CompressorKey idea behind compression:• Encode numbers as differences between them• Similar numbers lead to smaller representations
What’s tricky about adding geometry, GPGPU data?
• Negative values• Arbitrary data/attribute layout• Random access
• Each will limit how complicated the compressor’s design can be
58
Handling Negative ValuesColor and depth data is all positive – sign bit unused!
Not so for general data.• Overflow can occur during prediction and difference encoding• My approach
– Generalized past work to handle negative values– Drastically simplified processing of differences
• Limitation: can’t do any processing
33 Bit Adder
EncoderProcessing
59
Arbitrary Data Layout
X Y Zx1 y1 z1x2 y2 z2… … …
Geometry?
Color is simple
X Y Z U V Nx Ny Nz …
x1 y1 z1 u1 v1 …
x2 y2 z2 u2 v2 …
… … … … …
60
Arbitrary Data LayoutOur approach: encode vectors of data (rather than
blocks)• Color
– Alpha channel for free!• Geometry
– Intra-attribute coherence!
Limitation: no 2D coherence
X Y Zx1 y1 z1x2 y2 z2… … …… … …… … …xN yN zN
61
Random AccessRandom access is necessary for graphics• Color data maps well – 4x4, 8x8 tiles• Geometry?
– Simply encode a subset, C, of the vertices at a time
X x1 x2 … … xC xC+1 … … x2C ...Y y1 y2 … … yC yC+1 … … y2C ...Z z1 z2 … … zC zC+1 … … z2C ...
62
DoF
HDR_1
HDR_2
Parallax
Map
Smoke
Crysis
NFS:U Car
Carava
n
Soldier
0
20
40
60
80
100TiledUnified
Color Depth
Com
pres
sed
Siz
e (%
)Unified Compressor
Compared to (Ström 2008) (“Tiled”)Smaller is better!
Color channel coherence!
63
Unified CompressorGeometric data sets – uncompressed in
past work!
Data Set Compressed Bandwidth (%)Crysis 30.3Crysis: Warhead 55.6Need For Speed: Undercover 37.0Half Life 2: Lost Coast (Scene 1)
28.6
Half Life 2: Lost Coast (Scene 2)
23.8
64
Improvements to Existing Compressors
Just a brief mention of my other work:• Dynamic Bucket Selection
– Average of 1.25x improvement• Fibonacci Encoding
– Up to 1.7x improvement– Average of 1.12x for unified compressor
• Dynamic Range Reduction– Extra 5-20%, depending on application
65
On-Chip CommunicationFreeze (signal gate) unused bus lines
from register file to L1 cacheApplication Average
PrecisionEnergy (%)
Half-Life 2: Lost Coast (1) 10.9 63.7Half-Life 2: Lost Coast (2) 10.2 55.9Doom 3 9.8 62.5Need For Speed: Undercover
19.4 86.9
Metaballs 8.2 52.7
66
SUMMARY
67
Thesis StatementReducing the work done in the modern
graphics pipeline through novel communication and variable-precision computation techniques can enable a tradeoff between energy savings and image fidelity, leading to significant energy savings without perceptible loss of image quality.
68
How Did I Do?• Show that induced errors are imperceptible
– Vertex and pixel shader precisions can be reduced significantly without loss of quality
• Show significant energy savings– Find energy consumed by entire pipeline
• Energy model accurate to within 5% for tested applications– Find energy savings possible in each stage
• Designed hardware that saves energy• Used this hardware and the reduced precisions to find energy
savings in computation• Used precision information to enable further savings in on-
and off-chip communication
69
Batman HL2_low HL2_high Mass Effect0
100
200
300
400
500
600
700
Test Scene
Esti
mat
ed E
nerg
y (m
J)Estimated Energy Savings
54%
49%
46%57%
70
Future WorkAlong the same lines…
– Variable-precision FPU– Other sections of the memory hierarchy– Recently-introduced stages (geometry,
tessellation, compute shaders)– GPGPU applications
Larger scale…– 2-bit granularity precision control– Scheduling for dynamic voltage/frequency scaling
(DVFS)– Architectural studies
71
List of Papers• Jeff Pool, Anselmo Lastra, and Montek Singh, “Lossless Compression of
Variable-Precision Floating-Point Buffers on GPUs,” ACM Interactive 3D Graphics and Games (I3D), 9-11 March 2012.
• Jeff Pool, Anselmo Lastra, and Montek Singh, “Precision Selection for Energy-Efficient Pixel Shaders,” High Performance Graphics, 5-7 Aug. 2011.
• Jeff Pool, Anselmo Lastra, and Montek Singh, “Power-Gated Arithmetic
Circuits for Energy-Precision Tradeoffs in Mobile Graphics Processing Units,” Journal of Low Power Electronics, Vol. 7, No. 2, 2011.
• Jeff Pool, Anselmo Lastra, and Montek Singh, “An Energy Model for Graphics Processing Units,” IEEE International Conference on Computer Design, 3-6 Oct. 2010.
• Jeff Pool, Anselmo Lastra, and Montek Singh, “Energy-Precision Tradeoffs in Mobile Graphics Processing Units,” IEEE International Conference on Computer Design, 12-15 Oct. 2008.
72
AcknowledgmentsAdvisers: Anselmo and MontekCommittee members: Dinesh Manocha, Steve
Molnar, John PoultonJustin Hensley for starting the variable-precision
work
Various folks around the department for their feedback
Family and friends for their support and encouragement
The NSF for funding
73
THANKS, QUESTIONS?
74
BACKUP
75
Programmer-Directed
Directed StaticScene Precision PSNR Savings Precision SavingsSSAO 13.0 53.4 71% 20.1 49%Parallax 15.2 39.7 61% 23.3 -2%DoF 12.0 45.6 79% 18.5 33%
76
Results – Programmable UnitsOperation Energy (nJ)
add 0.443mul 0.357mad 0.455rcp 2.440exp 1.512log 5.177sin/cos 22.997pow 16.366
Local load 1.490Local store 1.490Global load (coalesced) 8.390Global store (coalesced) 5.190Global load (uncoalesced) 67.400Global store (uncoalesced) 42.700
M
emor
y
Arith
met
ic
77
Results – Fixed-Function Units
Rasterization Off OnEnergy/Pixel (pJ/P) 166.4 404.6Rasterization Cost (pJ/P)
- 238.2
Texturing Mipmapping Energy/pixel (nJ/p)Nearest - 13.3Bilinear - 13.8Nearest Nearest 7.07Bilinear Nearest 7.76Bilinear Linear 10.6
Qua
lity
Low
High
78
Typical Cell Phone Energy Consumption
http://www.androidcentral.com/android-quick-app-juice-defender-ultimate
Varies drastically depending on workloadMore efficient GPU == more time watching movies, playing games,
HTML5 …
Advertised talk times dwarf video playback/game times!
http://www.howtogeek.com/wp-content/uploads/2010/08/image207.pnghttp://tapatalk.com/mu/5adc833a-beea-1bf3.jpg