Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman Ganesh S. Dasika, Matthew R. Guthaus,

1 University of MichiganElectrical Engineering and Computer Science

Increasing the Number of Effective Registers in a Low-Power Processor

Using a Windowed Register File

Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman

Ganesh S. Dasika, Matthew R. Guthaus,

Scott A. Mahlke, Richard B. Brown

Department of Electrical Engineering and Computer Science

University of Michigan, Ann Arbor


Architected Registers: More or Less?• Fewer registers:

– Smaller hardware structures: more power efficient– Tighter instruction encoding, small memory footprint– However, more loads/stores to memory

• Reduce performance, increase in power

• More registers:– Larger hardware structures: less power efficient– Increase in code size, larger memory footprint– However,

• Map more variables from memory to registers: reduce power• Enable ILP optimizations


Objective of this Work• Provide a large number of architected registers• But, maintain instruction encoding and thus code size

• Use a windowed register file architecture– But, in an unconventional way

• Traditional register window– Reduce function save/restore overhead

• Our approach– Large register file partitioned into multiple windows– Appearance of a large register file


Windowed Register File ArchitectureMachine Status Register (MSR)

window status bit16-regs

3-bit operand field

FU

8-regs

8-regs

add r1, r2, r3 iw-mov r9, r1 win-swap #2 sub r2, r1, r3

0 0 10 r1: register 1 in register file

1 0 10 r1: register 9 in register filetoggle active window


Wireless Integrated Microsystems (WIMS)

Developed at the University of Michigan, (Robert Senger et al, DAC 2003)


Related Work• Traditional use of register windows

– Reduce save/restore, context switch overhead

– SPARC, IA-64, ADSP-219x, Tensilica

• Procedure call overhead small in embedded domain

– Procedure inlining reduces call/return overhead

– ~ 2% increase in performance using infinite register windows in WIMS

• Register connects[Kiyohara:93], register queues[Smelyanskiy:01] – Fixed ISA, provide more registers than allowed

– Layer of indirection to access every operand


Motivating Example

loop:LOAD R1-1, [SP, #24]ADD R1-0, R1-3, R1-1LOAD R1-0, [R1-0]LOAD R1-1, [SP, #32]STORE [SP, #72], R1-0ADD R1-0, R1-3, R1-1LOAD R1-0, [R1-0]LOAD R1-1, [SP, #72]MPY R1-0, R1-1, R1-0STORE [SP,#40], R1-0LOAD R1-0, [SP, #16]ADD R1-1, R1-3, R1-0LOAD R1-0, [R1-1]LOAD R1-1, [SP, #40]ADD R1-0, R1-1, R1-0LOAD R1-1, [SP, #80]STORE [R1-3], R1-0ADD R1-0, R1-1, #1ADD R1-3, R1-3, #4CMP R1-0, #100BRCT loop

loop:IW-MOV R1-0, R2-1WIN-SWAP #1ADD R1-3, R1-2, R1-0 IW-MOV R1-0, R2-2LOAD R1-1, [R1-3]ADD R1-3, R1-2, R1-0LOAD R1-0, [R1-3]MPY R1-3, R1-1, R1-0IW-MOV R1-1, R2-3ADD R1-1, R1-2, R1-1LOAD R1-1, [R1-0]ADD R1-0, R1-3, R1-1STORE [R1-2], R1-0WIN-SWAP #2ADD R2-0, R2-0, #1WIN-SWAP #1ADD R1-2, R1-2, #4WIN-SWAP #2CMP R2-0, #100BRCT loop

loop:ADD R1-3, R1-0, R1-6LOAD R1-2, [R1-3]ADD R1-3, R1-0, R1-7LOAD R1-4, [R1-3]MPY R1-3, R1-2, R1-4ADD R1-2, R1-0, R1-5ADD R1-1, R1-1, #1LOAD R1-4, [R1-2]ADD R1-2, R1-3, R1-4STORE [R1-0], R1-2ADD R1-0, R1-0, #4 CMP R1-1, #100BRCT loop

1-window of 8-registers 1-window of 4-registers 2-window of 4-registers each


Tradeoffs for the Compiler

• Move variables from memory to register

• Reduces spill code

• Distribute program variables and temporaries to all available registers in multiple windows

VS

Balance these issues in an intelligent manner

Register Utilization Register Management

• Reduce overhead due to window management instructions

• Activate windows (swaps)• Data transfer (iw-moves)

• Bundle accesses to same window

• Fewer transitions between windows


Register Window Partitioning

VR5

VR6 VR2

VR3

VR4

VR1

Partition-1

Partition-2• Weight Calculation

• Partition weight: Over-commitment of register resources• Edge weight: Penalty of separating VRs

• Partitioning algorithm:• Move nodes between partitions to minimize partition+edge wts• Modified FM graph partitioning algorithm


Edge Weight Calculation: Move Cost

loop:1 ADD VR34, VR27, VR322 LOAD VR6, [VR34]3 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop

3104

1

MPY VR10, VR6, VR9

IW-MOVE VR100, VR9 ( x 3104)

MPY VR10, VR6, VR100

VR6 VR9

edge weight = move-cost + swap-cost

Computed once before partitioning


Edge Weight Calculation: Swap Cost

edge weight = move-cost + swap-cost

VR6 VR9

edge weight = 3104 + 6208 = 9312

• swap cost : 2 x 3104 = 6208


3104

1

LOAD VR6, [VR34]

MPY VR10, VR6, VR9

LOAD VR9, [VR27]

SWAP

SWAP

active window

Computed once before partitioning


Partition Weight Calculation


3104

1

VR10

VR20 VR2 VR34

VR6

VR9

VR27 VR32

VR9

VR10 VR32

VR27

VR34

VR6

VR2

VR20

• Estimates the spill pressure using crude register allocation

• Partition weight = sum of the cost of all the spilled VRs• Computed dynamically during node assignment process


Partition Weight Calculation: Example•Assume 3-registers per window/partition, and all VRs are assigned to one partition

Spill Cost

VR32 : 3104 VR2: 9312

VRs 10,6,20,34: 6208 VR27: 12416

Spilled VRs = {32, 20}

VRs 32, 20 are spilled loop:12 LOAD VR6, [VR34]3 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop

3104

1

VR10

VR20 VR2 VR34

VR6

VR9

VR27 VR32

ADD VR24, VR27, VR32


Partition Weight Calculation: Example•Assume 3-registers per window/partition, and all VRs are assigned to one partition

Spill Cost

VR32 : 3104 VR2: 9312

VRs 10,6,20,34: 6208 VR27: 12416

Spilled VRs = {32, 20, 6}

VRs 32, 20 are spilled loop:1 ADD VR34, VR27, VR3223 LOAD VR9, [VR27]4 MPY VR10, VR6, VR95 ADD VR20, VR20, VR106 ADD VR2, VR2, #17 ADD VR27, VR27, #48 CMP VR2, 329 BRCT loop

3104

1

VR10

VR20 VR2 VR34

VR6

VR9

VR27 VR32

VRs 6 are spilledLOAD VR6, [VR34]

Continuing further, partition weight = spill cost of VRs 32, 20, 6, 10 = 21728


Node Partitioning: Example

Partition weight of P1 = sum of spill cost of VRs 32,20,6,10= 21728

VR9

VR10 VR32

VR27

VR34

VR6

VR2

VR20

P1 P2

Partition weight of P2 = 0

VR Partition Edge Total gain

2 6208 -2276 3932

6 6208 -11669 -5461

9 6208 -10723 -4515

10 6208 -10675 -4467

20 6208 -4234 1974

27 6208 -13008 -6800

32 3104 -7332 -4228

34 0 -10436 -10436

Total Gain = Partition Weight + Edge Weight

VR2


Node Partitioning: Final Example

Partition weight of P1 = spill cost of VRs 32 = 3104

VR9

VR10

VR32

VR27

VR34

VR6

VR2

VR20

P1 P2

Partition weight of P2 = 0

loop:1 WIN_SWAP #12 LOAD 32:R1-0, [SP, #0]3 ADD 34: R1-3, 27:R1-1, 32:R1-04 LOAD 9:R1-3, [27:R1-1]5 LOAD 6:R1-2, [34:R1-3]6 MPY 39:R1-0, 6:R1-2, 9:R1-37 IW_MOV 10:R2-2, 39:R1-08 WIN_SWAP #29 ADD 20:R2-1, 20:R2-1, 10:R2-210 ADD 2:R2-0, 2:R2-0, #111 WIN_SWAP #112 ADD 27:R1-1, 27:R1-1, #413 WIN_SWAP #214 CMP 2:R2-0, #3215 BRCT loop

31041

• Reduced from 6-spill to 1-spill operations• Added 5 additional window management instructions• Performance remains the same but decrease in power


Performance of WIMS: 8 registers/window 1-window vs 2 and 4 windows

-50

-40

-30

-20

-10

0

10

20

30

40

50

fir

raw

c

raw

d

g721

enc

g721

dec

com

pres

s

sha

yacc

cjpe

g

djpe

g

gsm

enc

gsm

dec

unep

ic

mpg

2dec

aver

age

Performance Spill benefit

Swap and move overhead

979783

5458

8591 6586

64 69

93 95

79 7858

85

75 849999

68 77 6976 5588 619999

% c

ycle

s


Performance of VLIW: 8-registers/window1-window vs 2 and 4 windows

82

-50

-40

-30

-20

-10

0

10

20

30

40

50

fir

raw

c

raw

d

g721

enc

g721

dec

com

pres

s

sha

yacc

cjpe

g

djpe

g

gsm

enc

gsm

dec

unep

ic

mpg

2dec

aver

age

9899 99

7086 66 72 36

7358

56 62

6390

29

50

7772

95 96

5177 62

79

699899 99

Performance Spill benefit

Swap and move overhead

% c

ycle

s

91


Power savings on the 8-register WIMS :1-window vs 2 and 4-window machine

0

5

10

15

20

fir

raw

c

raw

d

g721

enc

g721

dec

com

pres

s

sha

yacc

cjpe

g

djpe

g

gsm

enc

gsm

dec

unep

ic

mpe

g2de

c

aver

age

2-window 4-window

% p

ower

sav

ings


Conclusion• A novel graph partitioning based compiler algorithm to exploit windowed register files

within a single procedure

• Hardware/software solution to deal with reducing code size and maintaining effectively large number of register

w2.r4 w4.r4 w8.r4 w2.r16 w2.r8 w4.r8

WIMS 2.96 12.7 16.18 2.55 7 10

VLIW 10.58 26.38 33.83 5.03 18 22

• 7% reduction in power for the 8-register case on WIMS

Average improvement in performance


Swap Cost Over-Counting


3104

1

MPY VR10, VR6, VR9

LOAD VR9, [VR27]

SWAP - VR9, VR6SWAP - VR9, VR10SWAP - VR27, VR10SWAP - VR27, VR6

4-swaps!

In reality only,1 swap required

Solution : normalize swap cost

Swap cost between VRs 6 and 9 = 1/4 of cost of single swap = 1/4 * 3104 = 776

vr9 vr27 vr10 vr6


Swap Insertion & Optimization

• Remove redundant swaps

• Hoist swaps to less frequently executed region

• Combine swaps with other instructions

• BRL/RTS optimization

WIN_SWAP #1add r1, r2, r3sub r4, r1, r2load r1, [r4]IW_MOV r9, r1WIN_SWAP #2shl r3, r4, r5add r3, r9 #2load r2, [r3]Brl _foo()WIN_SWAP #1load r4, [r5]add r4, r4 #4

WIN_SWAP #1mov r1, #0load r4 [_a]mul r2, r3, r4


Performance of WIMS: 2-window 8-register vs 1-window 16-register

0

5

10

15

20

25


Overall Compilation System

FRONTEND PREPASS

SCHEDULINGCODE

GENERATION REGISTER

ALLOCATIONSWAP

INSERTIONPOSTPASS

SCHEDULINGREGISTER

PARTITIONING

CALCULATEEDGE

WEIGHTS

CALCULATEPARTITIONWEIGHTS

MOVE NODES

NAIVE SWAPINSERTION

SWAP OPTIMIZATION

Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman Ganesh S. Dasika, Matthew R. Guthaus,

Documents