Energy-Precision Tradeoffs in the Graphics Pipeline

Energy-Precision Tradeoffs in the Graphics Pipeline

Jeff PoolMarch 19th, 2012

2

MotivationWhy energy?

It matters everywhere:- Mobile devices- Desktop computers- Servers, data centers

It’s a bottleneck to performance!

http://img717.imageshack.us/img717/3936/1101771coolitomni.jpg

http://www.ornl.gov/ornlhome/images/casl/TVA%20Watts%20Bar.jpg





3

MotivationWhy precision?

Sign Exponent Mantissa

IEEE 754-2008 Single-Precision Floating-Point Representation

4

Don’t do Unnecessary Work• Max precision isn’t needed:

– 8-10 bit color buffers– FP32 => 24 bits of precision– Potentially lots of wasted effort!

• It’s certainly more complicated, but worth exploring

5

My ApproachVariable-precision computations- Reduce the precision when possible: 12.5 mantissa

bits used- Save energy in arithmetic: 70% less

energy- Low errors: 0.086%

differenceFull-Precision Arithmetic Reduced-Precision Arithmetic

6

My ApproachCommunicate fewer bits - Since fewer bits are used in computation - Most DRAM traffic is already compressed

Crysis, 2007

Variable-precision compression:(on sample frame)

- Geometry improved by 12%- Depth improved by 83%

7

GPU Global Memory Texture Frame-

Data Buffer

The Graphics Pipeline

Vertex Shader

Rasterization

Pixel Shader

Background

8

GPUs: A Brief History

Time

Cap

abili

ty

GPU

(NOT to scale!)

Fixe

d-Fu

nctio

n

Prog

ram

mab

ility

GPG

PU

CU

DA,

Str

eam

, O

penC

L

Shader Program

Compute Program

1.53, 32.8, …

10

Thesis StatementReducing the work done in the modern

graphics pipeline through novel communication and variable-precision computation techniques can enable a tradeoff between energy savings and image fidelity, leading to significant energy savings without perceptible loss of image quality.

11

How?Proving this thesis:

– Show that induced errors are imperceptible– Show significant energy savings

• Find energy consumed by entire pipeline• Find energy savings possible in each stage

12

Roadmap• My work

– Energy model– Energy savings in computation– Energy savings in communication

• Conclusions• Future work

13

Roadmap• My work



14

Why an Energy Model?So I’ll know how much difference saving energy in different stages actually makes, know where to focus• Provides researchers/developers a tool to

predict energy usagePast Work Validated? Graphics? Simple

?Brooks et al., 2000Eisley et al., 2006Shaeffer et al., 2004Ramani et al., 2007Nagasaka et al., 2010Hong and Kim, 2010Zhang et al., 2011

15

Strategy• Model construction

– Experimentally measure energy for each operation

• Energy prediction– Profile a scene for operations performed– Predict total energy consumption (dot

product)• Validation

– Compare prediction with measured energy

16

What Operations?• Arithmetic

– ADD, MUL, SIN/COS, POW, LOG, …• Memory

– Local/Global Load/Store• Programmable

– Vertex/Pixel Shaders• Fixed-function

– Rasterization, Texture filtering

Explicit

Implicit

17

Measuring Energy in the GPU

Explicit• GPGPU

– Runs on same hardware as graphics

– No ambiguity in operations

• Simple microkernels– Little/no overhead– 10s runtime– Directed tests per

operation

Implicit• OpenGL• Enable/Disable

operation in question– Difference in energy is

the operation’s contribution

– Not as straightforward• Ex.: Texture filtering

18

Experimental Setup• NVIDIA 8300GS graphics card• Adex Electronics’ PEX16LX PCI riser to interrupt

power from motherboard

• Supply metered power to the card– 12V– 3.3V– 12V (fan, not counted in energy)

• Log runtimes/framerates, measure current as tests run

http://www.pretaktovanie.sk/obr/spotreba/eng/PICTURES/P1010283_ENG.jpg

http://www.pretaktovanie.sk/obr/spotreba/eng/PICTURES/P1010283_ENG.jpg

19

ResultsOperation Energy (nJ)

Arithmetic 0.4 – 22.9

MemoryLocal load 1.49Local store 1.49Global load* 8.39 – 67.40Global store* 5.19 – 42.70 *Depending on type of access

Rasterization (per pixel) 0.24

Texture filtering (per pixel) 7.0 – 13.8

20

Profiling Operations Performed

• Use Microsoft’s PIX to log a frame of a running application:– Framebuffer contents– Vertex data– Render states– Vertex shaders– Pixel shaders– Per draw call

(100-1000s per frame)

• From all this data, extract operations

21

Validation• Three different applications, four scenes

– Real-world games to test the developed model• Harvested data, predict energy usage• Measured real energy usage, compare

Half Life 2: Lost Coast(High/Low Rendering

Qualities)

Batman: Arkham Asylum

Mass Effect

22

Validation Results

Batman HL2_low HL2_high Mass Effect0

100

200

300

400

500

600

700MeasuredPredicted

Test Scene

Ener

gy (

mJ)

Overheads

23

What Uses the Energy?

Batman

HL2_lo

w

HL2_h

igh

Mas

s Effec

t0

100200300400500600700

FB-WriteFB-ReadZ-WriteZ-ReadPS-MemoryPS-ArithmeticRasterizationVSRead Geometry

Test Scene

Esti

mat

ed E

nerg

y (m

J)

24

Roadmap• My work



25

Where Does the Power Go?

Ptotal = Pdynamic + Pstatic

Power

Ground

CMOS Inverter

26

Energy-Saving Techniques

Clock gating (Park et al., 2010)Signal gating (Huang and Ercegovac, 2003)Power gating

– Coarse (Usami et al., 2009, Sjalander et al., 2005)

– Fine (My work)

Ptotal = Pdynamic + Pstatic

27

!Enable

Example: 1-Bit Adder

Cin

A

BCout

S

28

HW Results

SPICE simulations of:Adders: linear savings

Multipliers: quadratic savings

29

Precision in RenderingVariable-Precision fixed-function CPU

rendering– Hao and Varshney, 2001– 3 key differences: GPU, FP32,

programmabilityDepth buffer comparator

– Hensley, Singh, and Lastra, 2005Triangle separation for correct occlusion

– Akeley and Su, 2006

30

VARIABLE-PRECISION PIXEL SHADERS

So, we have hardware, let’s see what happens in

31

A Pixel Shader

32

Exaggerated Texture Coordinate Errors

Blocky textures(8 mantissa bits)

Original frame(24 mantissa bits)

33

Arithmetic Errors

… Different?(8 mantissa bits)


34

Exaggerated Arithmetic Errors

Clearly different(4 mantissa bits)


35

Different Errors,Different Tolerances

• Colors can be pushed far lower– 12, 10, 8 bits for color components (plus

one for rounding)

• Texture coordinates may need to be fully precise!

36

So, Treat Them Separately

37


ACould contribute to texture coordinates

38


A

B

Could contribute to texture coordinates

Will NOT contribute to texture coordinates

39

Precision Selection Strategies

• Statically• Artist-directed• Automatic closed-loop

40

Static Program Analysis

9 bits10 bits

12 bits

9 bits10 bits And so on…

11 bits

41

Artist-Directed PrecisionsPrecisions are chosen as the effect is designed

42

Automatic Closed-Loop Precision Selection

Run time feedback controlPer-shader error detection and precision

control

Error DetectionRenderer

Controller

Display

Prec

isio

n

Reduced Pixel

Full Pixel(sparsely sampled)

Error

Reduced Pixel

43

Experimental SetupStatic analysis

– Analyze shaders to find minimum safe operating precision

Artist-directed– Modify several demo applications– Allow the artist to choose precisions

Automatic closed-loop– Modify the ATTILA GPU simulator– Apply several feedback control schemes– Several test scenes

44

Data SetsData Set Static Directed Automatic

Closed-Loop

Depth of Field

Parallax Mapping

SSAO

Half Life 2: Lost Coast

Doom 3

Need for Speed: UndercoverMetaballs

45

Data Sets

46

Results: PrecisionsData Set Static Directed Automatic

Closed-Loop

Depth of Field 18.5 12.0 -Parallax Mapping 23.3 15.2 -SSAO 20.1 13.0 -Half Life 2: Lost Coast 19.1 - 13.2Doom 3 19.7 - 14.7Need for Speed: Undercover

21.8 - 16.5Metaballs 9.7 - 8.9

Lower is Better!

47

Results: Closed-Loop ErrorsUnnoticeable in practice

48

Results: % Energy SavingsData Set Static Directed Automatic

Closed-Loop

Depth of Field 33% 79% -

Parallax Mapping -2% 61% -

SSAO 49% 71% -

Half Life 2: Lost Coast 33% - 75%

Doom 3 15% - 69%Need for Speed: Undercover

2% - 50%

Metaballs 87% - 90%

Higher is Better!

Overall Energy: 2/3 1/5

49

Which Precision Selection Method?

Approach Savings HW Complexity

Artist Effort

Static Low Low Low

Directed High Low Medium

Automatic Closed-Loop High High Low

50

Directed Approach• High savings

– 70-80% in arithmetic– 10-20% overall GPU energy

• (by arithmetic alone!)• Low errors

– Acceptable by design– Quantitatively low (PSNR, % error)

51

Variable Precision Geometry• Vertex shaders• Similarly high savings (55-80%)• Different types of errors

– XY Screen-space– Depth

52

XY Screen-Space Errors8 bits of precision

53

Depth Errors16 bits of precision

54

Depth Matters (Some)• Far before XY errors• Even in unmodified commercial games

http://underpop.free.fr/j/java/developing-games-in-java/1592730051_ch10lev1sec5.htmlhttps://encrypted-tbn3.google.com/images?q=tbn:ANd9GcRWhmviKHKMGVAU1ooXrAzJxa_2IlknTI6cRT4MGfJyTpaZNYw-MA

http://underpop.free.fr/j/java/developing-games-in-java/1592730051_ch10lev1sec5.html

55

COMMUNICATING LESS DATA

Variable-precision computation works. Let’s look at

56

Energy Savings in Communication

• Off-chip: compression (most data!)– Strom et al. (2008), Rasmussen et al. (2007,

2009)• 16 bit positive color/depth values• I adapt their approach to my needs

– Negative numbers– 32 bits– General values

• On-chip: bus encoding, caching– Reduced precision data freeze unused

lines

57

Unified CompressorKey idea behind compression:• Encode numbers as differences between them• Similar numbers lead to smaller representations

What’s tricky about adding geometry, GPGPU data?

• Negative values• Arbitrary data/attribute layout• Random access

• Each will limit how complicated the compressor’s design can be

58

Handling Negative ValuesColor and depth data is all positive – sign bit unused!

Not so for general data.• Overflow can occur during prediction and difference encoding• My approach

– Generalized past work to handle negative values– Drastically simplified processing of differences

• Limitation: can’t do any processing

33 Bit Adder

EncoderProcessing

59

Arbitrary Data Layout

X Y Zx1 y1 z1x2 y2 z2… … …

Geometry?

Color is simple

X Y Z U V Nx Ny Nz …

x1 y1 z1 u1 v1 …

x2 y2 z2 u2 v2 …

… … … … …

60

Arbitrary Data LayoutOur approach: encode vectors of data (rather than

blocks)• Color

– Alpha channel for free!• Geometry

– Intra-attribute coherence!

Limitation: no 2D coherence

X Y Zx1 y1 z1x2 y2 z2… … …… … …… … …xN yN zN

61

Random AccessRandom access is necessary for graphics• Color data maps well – 4x4, 8x8 tiles• Geometry?

– Simply encode a subset, C, of the vertices at a time

X x1 x2 … … xC xC+1 … … x2C ...Y y1 y2 … … yC yC+1 … … y2C ...Z z1 z2 … … zC zC+1 … … z2C ...

62

DoF

HDR_1

HDR_2

Parallax

Map

Smoke

Crysis

NFS:U Car

Carava

n

Soldier

0

20

40

60

80

100TiledUnified

Color Depth

Com

pres

sed

Siz

e (%

)Unified Compressor

Compared to (Ström 2008) (“Tiled”)Smaller is better!

Color channel coherence!

63

Unified CompressorGeometric data sets – uncompressed in

past work!

Data Set Compressed Bandwidth (%)Crysis 30.3Crysis: Warhead 55.6Need For Speed: Undercover 37.0Half Life 2: Lost Coast (Scene 1)

28.6

Half Life 2: Lost Coast (Scene 2)

23.8

64

Improvements to Existing Compressors

Just a brief mention of my other work:• Dynamic Bucket Selection

– Average of 1.25x improvement• Fibonacci Encoding

– Up to 1.7x improvement– Average of 1.12x for unified compressor

• Dynamic Range Reduction– Extra 5-20%, depending on application

65

On-Chip CommunicationFreeze (signal gate) unused bus lines

from register file to L1 cacheApplication Average

PrecisionEnergy (%)

Half-Life 2: Lost Coast (1) 10.9 63.7Half-Life 2: Lost Coast (2) 10.2 55.9Doom 3 9.8 62.5Need For Speed: Undercover

19.4 86.9

Metaballs 8.2 52.7

66

SUMMARY

67

Thesis StatementReducing the work done in the modern

graphics pipeline through novel communication and variable-precision computation techniques can enable a tradeoff between energy savings and image fidelity, leading to significant energy savings without perceptible loss of image quality.

68

How Did I Do?• Show that induced errors are imperceptible

– Vertex and pixel shader precisions can be reduced significantly without loss of quality

• Show significant energy savings– Find energy consumed by entire pipeline

• Energy model accurate to within 5% for tested applications– Find energy savings possible in each stage

• Designed hardware that saves energy• Used this hardware and the reduced precisions to find energy

savings in computation• Used precision information to enable further savings in on-

and off-chip communication

69

Batman HL2_low HL2_high Mass Effect0

100

200

300

400

500

600

700

Test Scene

Esti

mat

ed E

nerg

y (m

J)Estimated Energy Savings

54%

49%

46%57%

70

Future WorkAlong the same lines…

– Variable-precision FPU– Other sections of the memory hierarchy– Recently-introduced stages (geometry,

tessellation, compute shaders)– GPGPU applications

Larger scale…– 2-bit granularity precision control– Scheduling for dynamic voltage/frequency scaling

(DVFS)– Architectural studies

71

List of Papers• Jeff Pool, Anselmo Lastra, and Montek Singh, “Lossless Compression of

Variable-Precision Floating-Point Buffers on GPUs,” ACM Interactive 3D Graphics and Games (I3D), 9-11 March 2012.

• Jeff Pool, Anselmo Lastra, and Montek Singh, “Precision Selection for Energy-Efficient Pixel Shaders,” High Performance Graphics, 5-7 Aug. 2011.

• Jeff Pool, Anselmo Lastra, and Montek Singh, “Power-Gated Arithmetic

Circuits for Energy-Precision Tradeoffs in Mobile Graphics Processing Units,” Journal of Low Power Electronics, Vol. 7, No. 2, 2011.

• Jeff Pool, Anselmo Lastra, and Montek Singh, “An Energy Model for Graphics Processing Units,” IEEE International Conference on Computer Design, 3-6 Oct. 2010.

• Jeff Pool, Anselmo Lastra, and Montek Singh, “Energy-Precision Tradeoffs in Mobile Graphics Processing Units,” IEEE International Conference on Computer Design, 12-15 Oct. 2008.

72

AcknowledgmentsAdvisers: Anselmo and MontekCommittee members: Dinesh Manocha, Steve

Molnar, John PoultonJustin Hensley for starting the variable-precision

work

Various folks around the department for their feedback

Family and friends for their support and encouragement

The NSF for funding

73

THANKS, QUESTIONS?

74

BACKUP

75

Programmer-Directed

Directed StaticScene Precision PSNR Savings Precision SavingsSSAO 13.0 53.4 71% 20.1 49%Parallax 15.2 39.7 61% 23.3 -2%DoF 12.0 45.6 79% 18.5 33%

76

Results – Programmable UnitsOperation Energy (nJ)

add 0.443mul 0.357mad 0.455rcp 2.440exp 1.512log 5.177sin/cos 22.997pow 16.366

Local load 1.490Local store 1.490Global load (coalesced) 8.390Global store (coalesced) 5.190Global load (uncoalesced) 67.400Global store (uncoalesced) 42.700

M

emor

y

Arith

met

ic

77

Results – Fixed-Function Units

Rasterization Off OnEnergy/Pixel (pJ/P) 166.4 404.6Rasterization Cost (pJ/P)

- 238.2

Texturing Mipmapping Energy/pixel (nJ/p)Nearest - 13.3Bilinear - 13.8Nearest Nearest 7.07Bilinear Nearest 7.76Bilinear Linear 10.6

Qua

lity

Low

High

78

Typical Cell Phone Energy Consumption

http://www.androidcentral.com/android-quick-app-juice-defender-ultimate

Varies drastically depending on workloadMore efficient GPU == more time watching movies, playing games,

HTML5 …

Advertised talk times dwarf video playback/game times!

http://www.howtogeek.com/wp-content/uploads/2010/08/image207.pnghttp://tapatalk.com/mu/5adc833a-beea-1bf3.jpg

http://www.androidcentral.com/android-quick-app-juice-defender-ultimate

Energy-Precision Tradeoffs in the Graphics Pipeline

Documents

fixed energy

energy usage

motivationwhy energy

energy low errors

computationenergy savings

variableprecision compression

bits of precisionpotentially

mantissa bits