Page 1
1 University of MichiganElectrical Engineering and Computer Science
Increasing the Number of Effective Registers in a Low-Power Processor
Using a Windowed Register File
Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman
Ganesh S. Dasika, Matthew R. Guthaus,
Scott A. Mahlke, Richard B. Brown
Department of Electrical Engineering and Computer Science
University of Michigan, Ann Arbor
Page 2
2 University of MichiganElectrical Engineering and Computer Science
Architected Registers: More or Less?• Fewer registers:
– Smaller hardware structures: more power efficient– Tighter instruction encoding, small memory footprint– However, more loads/stores to memory
• Reduce performance, increase in power
• More registers:– Larger hardware structures: less power efficient– Increase in code size, larger memory footprint– However,
• Map more variables from memory to registers: reduce power• Enable ILP optimizations
Page 3
3 University of MichiganElectrical Engineering and Computer Science
Objective of this Work• Provide a large number of architected registers• But, maintain instruction encoding and thus code size
• Use a windowed register file architecture– But, in an unconventional way
• Traditional register window– Reduce function save/restore overhead
• Our approach– Large register file partitioned into multiple windows– Appearance of a large register file
Page 4
4 University of MichiganElectrical Engineering and Computer Science
Windowed Register File ArchitectureMachine Status Register (MSR)
window status bit16-regs
3-bit operand field
FU
8-regs
8-regs
add r1, r2, r3 iw-mov r9, r1 win-swap #2 sub r2, r1, r3
0 0 10 r1: register 1 in register file
1 0 10 r1: register 9 in register filetoggle active window
Page 5
5 University of MichiganElectrical Engineering and Computer Science
Wireless Integrated Microsystems (WIMS)
Developed at the University of Michigan, (Robert Senger et al, DAC 2003)
Page 6
6 University of MichiganElectrical Engineering and Computer Science
Related Work• Traditional use of register windows
– Reduce save/restore, context switch overhead
– SPARC, IA-64, ADSP-219x, Tensilica
• Procedure call overhead small in embedded domain
– Procedure inlining reduces call/return overhead
– ~ 2% increase in performance using infinite register windows in WIMS
• Register connects[Kiyohara:93], register queues[Smelyanskiy:01] – Fixed ISA, provide more registers than allowed
– Layer of indirection to access every operand
Page 7
7 University of MichiganElectrical Engineering and Computer Science
Motivating Example
loop:LOAD R1-1, [SP, #24]ADD R1-0, R1-3, R1-1LOAD R1-0, [R1-0]LOAD R1-1, [SP, #32]STORE [SP, #72], R1-0ADD R1-0, R1-3, R1-1LOAD R1-0, [R1-0]LOAD R1-1, [SP, #72]MPY R1-0, R1-1, R1-0STORE [SP,#40], R1-0LOAD R1-0, [SP, #16]ADD R1-1, R1-3, R1-0LOAD R1-0, [R1-1]LOAD R1-1, [SP, #40]ADD R1-0, R1-1, R1-0LOAD R1-1, [SP, #80]STORE [R1-3], R1-0ADD R1-0, R1-1, #1ADD R1-3, R1-3, #4CMP R1-0, #100BRCT loop
loop:IW-MOV R1-0, R2-1WIN-SWAP #1ADD R1-3, R1-2, R1-0 IW-MOV R1-0, R2-2LOAD R1-1, [R1-3]ADD R1-3, R1-2, R1-0LOAD R1-0, [R1-3]MPY R1-3, R1-1, R1-0IW-MOV R1-1, R2-3ADD R1-1, R1-2, R1-1LOAD R1-1, [R1-0]ADD R1-0, R1-3, R1-1STORE [R1-2], R1-0WIN-SWAP #2ADD R2-0, R2-0, #1WIN-SWAP #1ADD R1-2, R1-2, #4WIN-SWAP #2CMP R2-0, #100BRCT loop
loop:ADD R1-3, R1-0, R1-6LOAD R1-2, [R1-3]ADD R1-3, R1-0, R1-7LOAD R1-4, [R1-3]MPY R1-3, R1-2, R1-4ADD R1-2, R1-0, R1-5ADD R1-1, R1-1, #1LOAD R1-4, [R1-2]ADD R1-2, R1-3, R1-4STORE [R1-0], R1-2ADD R1-0, R1-0, #4 CMP R1-1, #100BRCT loop
1-window of 8-registers 1-window of 4-registers 2-window of 4-registers each
Page 8
8 University of MichiganElectrical Engineering and Computer Science
Tradeoffs for the Compiler
• Move variables from memory to register
• Reduces spill code
• Distribute program variables and temporaries to all available registers in multiple windows
VS
Balance these issues in an intelligent manner
Register Utilization Register Management
• Reduce overhead due to window management instructions
• Activate windows (swaps)• Data transfer (iw-moves)
• Bundle accesses to same window
• Fewer transitions between windows
Page 9
9 University of MichiganElectrical Engineering and Computer Science
Register Window Partitioning
VR5
VR6 VR2
VR3
VR4
VR1
Partition-1
Partition-2• Weight Calculation
• Partition weight: Over-commitment of register resources• Edge weight: Penalty of separating VRs
• Partitioning algorithm:• Move nodes between partitions to minimize partition+edge wts• Modified FM graph partitioning algorithm
Page 10
10 University of MichiganElectrical Engineering and Computer Science
Edge Weight Calculation: Move Cost
loop:1 ADD VR34, VR27, VR322 LOAD VR6, [VR34]3 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop
3104
1
MPY VR10, VR6, VR9
IW-MOVE VR100, VR9 ( x 3104)
MPY VR10, VR6, VR100
VR6 VR9
edge weight = move-cost + swap-cost
Computed once before partitioning
Page 11
11 University of MichiganElectrical Engineering and Computer Science
Edge Weight Calculation: Swap Cost
edge weight = move-cost + swap-cost
VR6 VR9
edge weight = 3104 + 6208 = 9312
• swap cost : 2 x 3104 = 6208
loop:1 ADD VR34, VR27, VR322 LOAD VR6, [VR34]3 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop
3104
1
LOAD VR6, [VR34]
MPY VR10, VR6, VR9
LOAD VR9, [VR27]
SWAP
SWAP
active window
Computed once before partitioning
Page 12
12 University of MichiganElectrical Engineering and Computer Science
Partition Weight Calculation
loop:1 ADD VR34, VR27, VR322 LOAD VR6, [VR34]3 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop
3104
1
VR10
VR20 VR2 VR34
VR6
VR9
VR27 VR32
VR9
VR10 VR32
VR27
VR34
VR6
VR2
VR20
• Estimates the spill pressure using crude register allocation
• Partition weight = sum of the cost of all the spilled VRs• Computed dynamically during node assignment process
Page 13
13 University of MichiganElectrical Engineering and Computer Science
Partition Weight Calculation: Example•Assume 3-registers per window/partition, and all VRs are assigned to one partition
Spill Cost
VR32 : 3104 VR2: 9312
VRs 10,6,20,34: 6208 VR27: 12416
Spilled VRs = {32, 20}
VRs 32, 20 are spilled loop:12 LOAD VR6, [VR34]3 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop
3104
1
VR10
VR20 VR2 VR34
VR6
VR9
VR27 VR32
ADD VR24, VR27, VR32
Page 14
14 University of MichiganElectrical Engineering and Computer Science
Partition Weight Calculation: Example•Assume 3-registers per window/partition, and all VRs are assigned to one partition
Spill Cost
VR32 : 3104 VR2: 9312
VRs 10,6,20,34: 6208 VR27: 12416
Spilled VRs = {32, 20, 6}
VRs 32, 20 are spilled loop:1 ADD VR34, VR27, VR3223 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop
3104
1
VR10
VR20 VR2 VR34
VR6
VR9
VR27 VR32
VRs 6 are spilledLOAD VR6, [VR34]
Continuing further, partition weight = spill cost of VRs 32, 20, 6, 10 = 21728
Page 15
15 University of MichiganElectrical Engineering and Computer Science
Node Partitioning: Example
Partition weight of P1 = sum of spill cost of VRs 32,20,6,10= 21728
VR9
VR10 VR32
VR27
VR34
VR6
VR2
VR20
P1 P2
Partition weight of P2 = 0
VR Partition Edge Total gain
2 6208 -2276 3932
6 6208 -11669 -5461
9 6208 -10723 -4515
10 6208 -10675 -4467
20 6208 -4234 1974
27 6208 -13008 -6800
32 3104 -7332 -4228
34 0 -10436 -10436
Total Gain = Partition Weight + Edge Weight
VR2
Page 16
16 University of MichiganElectrical Engineering and Computer Science
Node Partitioning: Final Example
Partition weight of P1 = spill cost of VRs 32 = 3104
VR9
VR10
VR32
VR27
VR34
VR6
VR2
VR20
P1 P2
Partition weight of P2 = 0
loop:1 WIN_SWAP #12 LOAD 32:R1-0, [SP, #0]3 ADD 34: R1-3, 27:R1-1, 32:R1-04 LOAD 9:R1-3, [27:R1-1]5 LOAD 6:R1-2, [34:R1-3]6 MPY 39:R1-0, 6:R1-2, 9:R1-37 IW_MOV 10:R2-2, 39:R1-08 WIN_SWAP #29 ADD 20:R2-1, 20:R2-1, 10:R2-210 ADD 2:R2-0, 2:R2-0, #111 WIN_SWAP #112 ADD 27:R1-1, 27:R1-1, #413 WIN_SWAP #214 CMP 2:R2-0, #3215 BRCT loop
31041
• Reduced from 6-spill to 1-spill operations• Added 5 additional window management instructions• Performance remains the same but decrease in power
Page 17
17 University of MichiganElectrical Engineering and Computer Science
Performance of WIMS: 8 registers/window 1-window vs 2 and 4 windows
-50
-40
-30
-20
-10
0
10
20
30
40
50
fir
raw
c
raw
d
g721
enc
g721
dec
com
pres
s
sha
yacc
cjpe
g
djpe
g
gsm
enc
gsm
dec
unep
ic
mpg
2dec
aver
age
Performance Spill benefit
Swap and move overhead
979783
5458
8591 6586
64 69
93 95
79 7858
85
75 849999
68 77 6976 5588 619999
% c
ycle
s
Page 18
18 University of MichiganElectrical Engineering and Computer Science
Performance of VLIW: 8-registers/window1-window vs 2 and 4 windows
82
-50
-40
-30
-20
-10
0
10
20
30
40
50
fir
raw
c
raw
d
g721
enc
g721
dec
com
pres
s
sha
yacc
cjpe
g
djpe
g
gsm
enc
gsm
dec
unep
ic
mpg
2dec
aver
age
9899 99
7086 66 72 36
7358
56 62
6390
29
50
7772
95 96
5177 62
79
699899 99
Performance Spill benefit
Swap and move overhead
% c
ycle
s
91
Page 19
19 University of MichiganElectrical Engineering and Computer Science
Power savings on the 8-register WIMS :1-window vs 2 and 4-window machine
0
5
10
15
20
fir
raw
c
raw
d
g721
enc
g721
dec
com
pres
s
sha
yacc
cjpe
g
djpe
g
gsm
enc
gsm
dec
unep
ic
mpe
g2de
c
aver
age
2-window 4-window
% p
ower
sav
ings
Page 20
20 University of MichiganElectrical Engineering and Computer Science
Conclusion• A novel graph partitioning based compiler algorithm to exploit windowed register files
within a single procedure
• Hardware/software solution to deal with reducing code size and maintaining effectively large number of register
w2.r4 w4.r4 w8.r4 w2.r16 w2.r8 w4.r8
WIMS 2.96 12.7 16.18 2.55 7 10
VLIW 10.58 26.38 33.83 5.03 18 22
• 7% reduction in power for the 8-register case on WIMS
Average improvement in performance
Page 21
21 University of MichiganElectrical Engineering and Computer Science
Swap Cost Over-Counting
loop:1 ADD VR34, VR27, VR322 LOAD VR6, [VR34]3 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop
3104
1
MPY VR10, VR6, VR9
LOAD VR9, [VR27]
SWAP - VR9, VR6SWAP - VR9, VR10SWAP - VR27, VR10SWAP - VR27, VR6
4-swaps!
In reality only,1 swap required
Solution : normalize swap cost
Swap cost between VRs 6 and 9 = 1/4 of cost of single swap = 1/4 * 3104 = 776
vr9 vr27 vr10 vr6
Page 22
22 University of MichiganElectrical Engineering and Computer Science
Swap Insertion & Optimization
• Remove redundant swaps
• Hoist swaps to less frequently executed region
• Combine swaps with other instructions
• BRL/RTS optimization
WIN_SWAP #1add r1, r2, r3sub r4, r1, r2load r1, [r4]IW_MOV r9, r1WIN_SWAP #2shl r3, r4, r5add r3, r9 #2load r2, [r3]Brl _foo()WIN_SWAP #1load r4, [r5]add r4, r4 #4
WIN_SWAP #1mov r1, #0load r4 [_a]mul r2, r3, r4
Page 23
23 University of MichiganElectrical Engineering and Computer Science
Performance of WIMS: 2-window 8-register vs 1-window 16-register
0
5
10
15
20
25
Page 24
24 University of MichiganElectrical Engineering and Computer Science
Overall Compilation System
FRONTEND PREPASS
SCHEDULINGCODE
GENERATION REGISTER
ALLOCATIONSWAP
INSERTIONPOSTPASS
SCHEDULINGREGISTER
PARTITIONING
CALCULATEEDGE
WEIGHTS
CALCULATEPARTITIONWEIGHTS
MOVE NODES
NAIVE SWAPINSERTION
SWAP OPTIMIZATION