Top Banner
CMPUT 680 - Compiler Des ign and Optimization 1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral http://www.cs.ualberta.ca/~amaral/courses/680
79

CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

Mar 31, 2015

Download

Documents

Sonny Evitt
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

1

CMPUT680 - Winter 2006

Topic F: IA-64 Hardware Support for Software Pipelining

José Nelson Amaralhttp://www.cs.ualberta.ca/~amaral/courses/680

Page 2: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

2

Suggested Reading

Intel IA-64 Architecture SoftwareDeveloper’s Manual, Chapters 8, 9

Page 3: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

3

Instruction Group

An instruction group is a set of instructions thathave no read after write (RAW) or write after write (WAW)

register dependencies.Consecutive instruction groups are separated by stops

(represented by a double semi-column in the assembly code).

ld8 r1=[r5] // First groupsub r6=r8, r9 // First groupadd r3=r1,r4 ;; // First groupst8 [r6]=r12 // Second group

Page 4: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

4

Instruction Bundles

Instructions are organized in bundles of three instructions,with the following format:

instruction slot 2 instruction slot 1 instruction slot 0 template127 8786 46 45 5 4 0

41 41 41 5

Instruction Description Execution Unit Type

A Integer ALU I-unit or M-unit I Non-ALU

integer I-unit

M Memory M-unit F Floating-Point F-unit B Branch B-unit

L+X Extended I-unit/B-unit

Page 5: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

5

Bundles

In assembly, each 128-bit bundle is enclosed in curly braces and contains a template specification

{ .miild4 r28=[r8] // Load a 4-byte valueadd r9=2,r1 // 2+r1 and put in r9add r30=1,r1 // 1+r1 and put in r30

}

An instruction group can extend over an arbitrarynumber of bundles.

Page 6: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

6

Templates

There are restrictions on the type of instructions thatcan be bundled together. The IA-64 has five slot types(M, I, F, B, and L), six instruction types (M, I, A, F, B, L),and twelve basic template types (MII, MI_I, MLX, MMI,M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB).

The underscore in the bundle accronym indicates a stop.

Every basic bundle type has two versions: one with a stop at the end of the bundle and one without.

Page 7: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

7

Control Dependency Preventing Code Motion

add r7=r6,1 // cycle 0 add r13=r25, r27 cmp.eq p1, p2=r12, r23(p1) br. cond some_label ;;

ld4 r2=[r3] ;; // cycle 1 sub r4=r2, r11 // cycle 3

ld

brblock A

block B

In the code below, the ld4 is control dependent on thebranch, and thus cannot be safely moved up in conventional processor architectures.

Page 8: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

8

Control Speculation

(p1) br.cond.dptk L1 // cycle 0 ld8 r3=[r5] ;; // cycle 1 shr r7=r3,r87 // cycle 3

In the following code, suppose a load latency of two cycles

However, if we execute the load before we know thatwe actually have to do it (control speculation), we get:

ld8.s r3=[r5] // earlier cycle // other, unrelated instructions(p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1

Page 9: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

9

Control Speculation

ld8.s r3=[r5] // earlier cycle // other, unrelated instructions(p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1

The ld8.s instruction is a speculative load, and thechk.s instruction is a check instruction that verifiesif the value loaded is still good.

Page 10: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

10

Ambiguous Memory Dependencies

An ambiguous memory dependency is a dependencebetween a load and a store, or between two stores,where it cannot be determined if the instructions involved access overlapping memory locations.

Two or more memory references are independentif it is known that they access non-overlapping memory locations.

Page 11: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

11

Data Speculation

An advanced load allows a load to be movedabove a store even if it is not known wetherthe load and the store may reference overlappingmemory locations.

st8 [r55]=r45 // cycle 0ld8 r3=[r5] ;; // cycle 0shr r7=r3,r87 // cycle 2

ld8.a r3=[r5] ;; // Advanced Load// other, unrelated instructionsst8 [r55]=r45 // cycle 0ld8.c r3=[r5] ;; // cycle 0 - checkshr r7=r3,r87 // cycle 0

Page 12: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

12

Moving Up Loads + Uses: Recovery Code

st8 [r4] = r12 // cycle 0: ambiguous storeld8 r6 = [r8] ;; // cycle 0: load to advanceadd r5 = r6,r7 // cycle 2st8 [r18] = r5 // cycle 3

Original Code

ld8.a r6 = [r8] ;; // cycle -3// other, unrelated instructionsadd r5 = r6,r7 // cycle -1; add that uses r6// other, unrelated instructionsst8 [r4]=r12 // cycle 0chk.a r6, recover // cycle 0: checkback: // Return point from jump to recoverst8 [r18] = r5 // cycle 0

recover:ld8 r6 = [r8] ;; // Reload r6 from [r8] add r5 = r6,r7 // Re-execute the addbr back // Jump back to main code

SpeculativeCode

Page 13: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

13

ld.c, chk.a and the ALAT

The execution of an advanced load, ld.a, creates anentry in a hardware structure, the Advanced LoadAddress Table (ALAT). This table is indexed by theregister number. Each entry records the loadaddress, the load type, and the size of the load.

When a check is executed, the entry for the registeris checked to verify that a valid enter with the typespecified is there.

Page 14: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

14

ld.c, chk.a and the ALAT

Entries are removed from the ALAT when:

(1) A store overlaps with the memory locations specified in the ALAT entry;(2) Another advanced load to the same register is executed;(3) There is a context switch caused by the operating system (or hardware);(4) Capacity limitation of the ALAT implementation requires reuse of the entry.

Page 15: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

15

Not a Thing (NaT)

The IA-64 has 128 general purpose registers, eachwith 64+1 bits, and 128 floating point registers, eachwith 82 bits.

The extra bit in the GPRs is the NaT bit that is used toindicate that the content of the register is not valid.

NaT=1 indicates that an instruction that generated anexception wrote to the register. It is a way to deferexceptions caused by speculative loads.

Any operation that uses NaT as an operand results in NaT.

Page 16: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

16

If-conversion

If-conversion uses predicates to transform aconditional code into a single control stream code.

if(r4) {add r1= r2, r3ld8 r6=[r5]

}

cmp.ne p1, p0=r4, 0 ;; Set predicate reg(p1) add r1=r2, r3(p1) ld8 r6=[r5]

if(r1)r2 = r3 + r3

elser7 = r6 - r5

cmp.ne p1, p2 = r1, 0 ;; Set predicate reg(p1) add r2 = r3, r4(p2) sub r7 = r6,r5

Page 17: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

Code Generation for Software Pipelining

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,N Lat.(a) uk load zk-1 (6)(b) vk load xk-1 (6)(c) wk uk * vk (2)(d) qk qk-1 + wk (2)(e) zk zk-1 + 4 (1)(f) xk xk-1 + 4 (1)END DO

cycle MEM1 MEM2 ADD1 ADD2 FMLT FADD1 a1 b1 e1 f123 a2 b2 e2 f245 a3 b3 e3 f367 a4 b4 e4 f4 c189 a5 b5 e5 f5 c2 d11011 a6 b6 e6 f6 c3 d2

… … … … … …201 100a 10b 0 e100 f100 9c 7 d96202203 9c 8 d97204205 9c 9 d98206207 1c 00 d99208209 d100

Page 18: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

Code Generation for Software Pipelining

z0 &Z(1)x0 &X(1)q0 0.0(a1) u1 load z0 (b1) v1 load x0

(e1) z1 z0 + 4(f1) x1 x0 + 4 (a2) u2 load z1 (b2) v2 load x1

(e2) z2 z1 + 4(f2) x2 x1 + 4(a3) u3 load z2 (b3) v3 load x2

(e3) z3 z2 + 4(f3) x3 x2 + 4(a4) u4 load z3 (b4) v4 load x3

(c1) w1 u0 * v0

(e4) z4 z3 + 4(f4) x4 x3 + 4

cycle MEM1 MEM2 ADD1 ADD2 FMLT FADD1 a1 b1 e1 f123 a2 b2 e2 f245 a3 b3 e3 f367 a4 b4 e4 f4 c189 a5 b5 e5 f5 c2 d11011 a6 b6 e6 f6 c3 d2

… … … … … …201 100a 10b 0 e100 f100 9c 7 d96202203 9c 8 d97204205 9c 9 d98206207 1c 00 d99208209 d100

Page 19: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

Code Generation for Software Pipelining

DO k=1,N-4(ak+4) uk+4 load zk+3 (bk+4) vk+4 load xk+3

(ck+1) wk+1 uk * vk

(d) qk qk-1 + wk (ek+4) zk+4 zk+3 + 4(fk+4) xk+4 xk+3 + 4 END DO(c98) w98 u97 * v97

(d97) q97 q96 + w97

(c99) w99 u98 * v98

(d98) q98 q97 + w98

(c100) w100 u99 * v99

(d99) q99 q98 + w99

(d100) q100 q99 + w100

cycle MEM1 MEM2 ADD1 ADD2 FMLT FADD1 a1 b1 e1 f123 a2 b2 e2 f245 a3 b3 e3 f367 a4 b4 e4 f4 c189 a5 b5 e5 f5 c2 d11011 a6 b6 e6 f6 c3 d2

… … … … … …201 100a 10b 0 e100 f100 9c 7 d96202203 9c 8 d97204205 9c 9 d98206207 1c 00 d99208209 d100

Page 20: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

Code Generation for Software Pipelining

z0 &Z(1)x0 &X(1)q0 0.0DO k=1,4(a) uk load zk-1 (b) vk load xk-1

(e) zk zk-1 + 4(f) xk xk-1 + 4 END DO(c) w1 u1 * v1

DO k=5,N-4(a) uk+4 load zk+3 (b) vk+4 load xk+3

(c) wk+1 uk+1 * vk+1

(d) qk qk-1 + wk (e) zk+4 zk+3 + 4(f) xk+4 xk+3 + 4 END DO

prologcounter

loopcounter

cycle MEM1 MEM2 ADD1 ADD2 FMLT FADD1 a1 b1 e1 f123 a2 b2 e2 f245 a3 b3 e3 f367 a4 b4 e4 f4 c189 a5 b5 e5 f5 c2 d11011 a6 b6 e6 f6 c3 d2

… … … … … …201 100a 10b 0 e100 f100 9c 7 d96202203 9c 8 d97204205 9c 9 d98206207 1c 00 d99208209 d100

Page 21: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

Code Generation for Software Pipelining

DO k=N-3,N(c) wk+1 uk+1 * vk+1

(d) qk qk-1 + wk END DO(d) q100 q99 + w100

epilogcounter

cycle MEM1 MEM2 ADD1 ADD2 FMLT FADD1 a1 b1 e1 f123 a2 b2 e2 f245 a3 b3 e3 f367 a4 b4 e4 f4 c189 a5 b5 e5 f5 c2 d11011 a6 b6 e6 f6 c3 d2

… … … … … …201 100a 10b 0 e100 f100 9c 7 d96202203 9c 8 d97204205 9c 9 d98206207 1c 00 d99208209 d100

Page 22: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

Code Generation for Software Pipelining(try 3)

R0 &Z(1) R1 &X(1) F0 0.0 R2 1loop: F1 load [R0] F2 load [R1] F3 mult F1, F2

F0 add F0, F3

R0 add R0, 4 R1 add R1, 4 R2 add R2, 1 brne R2, N loop

But, we still have not solvedthe register allocation problem.

The code on the right needsa large number of registers.What can we do about it?

Without software pipeliningthe following code could

be generated.

Page 23: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

23

Optimization of LoopsL1: ld4 r4 = [r5], 4 ;; // Cycle 0 load postinc 4 add r7 = r4, r9 ;; // Cycle 2 st4 [r6] = r7, 4 // Cycle 3 store postinc 4 br.cloop L1 ;; // Cycle 3

Instructions Description:

ld4 r4 = [r5], 4 ;; r4 MEM[r5]r5 r5 + 4

st4 [r6] = r7, 4 MEM[r6] r7r6 r6 + 4

br.cloop L1 if LC 0then LC LC -1 goto L1

Page 24: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

24

Optimization of Loops

(a) L1: ld4 r4 = [r5], 4 ;; (b) add r7 = r4, r9 ;; (c) st4 [r6] = r7, 4 (d) br.cloop L1 ;;

1 2 3 40 a12 b3 c/d4 a56 b7 c/d8 a910 b

Cycle

s

Iterations

11 c/d12 a1314 b

If LC=1000, how long doesit take for this loop to execute?

It takes 4000 cycles.

Page 25: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

25

Optimization of Loops:Loop Unrolling

(a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;;(d) add r17 = r14, r9(e) st4 [r6] = r7,4 ;;(f) st4 [r6] = r17,4 (g) br.cloop L1 ;;

Cycle

s

Iterations1 2 3 4

0 a1 b2 c3 d/e4 f/g5 a6 b7 c8 d/e9 f/g10 a11 b12 c13 d/e14 f/g

For simplicity we assumed thatN is a multiple of 2.

Because the loads (a) and (b)both update r5 they have to beserialized

Page 26: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

26

Optimization of Loops:Loop Unrolling

(a) L1: ld4 r4 = [r5], 4 ;; (b) ld4 r14 = [r5], 4 ;; (c) add r7 = r4, r9 ;;(d) add r17 = r14, r9(e) st4 [r6] = r7,4 ;;(f) st4 [r6] = r17,4 (g) br.cloop L1 ;;

Cycle

s

Iterations1 2 3 4

0 a1 b2 c3 d/e4 f/g5 a6 b7 c8 d/e9 f/g10 a11 b12 c13 d/e14 f/g

If LC=1000 for the originalloop, how long does

it take for this loop to execute?

It takes 2500 cycles.Thus the loop is

4000/2500 = 1.6 times faster

Page 27: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

27

Optimization of Loops:Expanding the Induction Variable

add r15 = 4, r5 add r16 = 4, r6 ;;(a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9(e) st4 [r6] = r7,8 ;;(f) st4 [r16] = r17,8 (g) br.cloop L1 ;;

Cycle

s

Iterations1 2 3 4

0 a/b12 c/d3 e/f/g4 a/b56 c/d7 e/f/g8 a/b910 c/d11 e/f/g12 a/b1314 c/d

We use twice as many functionalunits as the original code.

But no instruction is issued incycle 1, and functional unitsare still under-utilized.

Page 28: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

28

Optimization of Loops:Expanding the Induction Variable

add r15 = 4, r5 add r16 = 4, r6 ;;(a) L1: ld4 r4 = [r5], 8 (b) ld4 r14 = [r15], 8 ;; (c) add r7 = r4, r9 (d) add r17 = r14, r9(e) st4 [r6] = r7,8 (f) st4 [r6] = r17,8 (g) br.cloop L1 ;;

Cycle

s

Iterations1 2 3 4

0 a/b12 c/d3 e/f/g4 a/b56 c/d7 e/f/g8 a/b910 c/d11 e/f/g12 a/b1314 c/d

If LC=1000 for the originalloop, how long does

it take for this loop to execute?

It takes 2000 cycles.Thus the loop is

4000/2000 = 2.0 times faster

Page 29: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

29

Optimization of Loops:Further Loop Unrolling

add r15 = 4, r5 add r25 = 8, r5 add r35 = 12, r5 add r16 = 4, r6 add r26 = 8, r6 add r36 = 12, r6 ;; add r16 = 4, r6 ;;(a) L1: ld4 r4 = [r5], 16 (b) ld4 r14 = [r15], 16 ;;(c) ld4 r24 = [r25], 16(d) ld4 r34 = [r35], 16 ;;(e) add r7 = r4, r9 (f) add r17 = r14, r9;;(g) st4 [r6] = r7,16 (h) st4 [r16] = r17,16(i) add r27 = r24, r9(j) add r37 = r34, r9 ;;(k) st4 [r26] = r27, 16(l) st4 [r36] = r37, 16 (m) br.cloop L1 ;;

Iterations

Cycle

s

1 2 3 40 a/b1 c/d2 e/f3 g/h/i/j4 k/l/m5 a/b6 c/d7 e/f8 g/h/i/j9 k/l/m10 a/b11 c/d12 e/f13 g/h/i/j14 k/l/m

Page 30: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

30

Optimization of Loops:Further Loop Unrolling

Iterations

Cycle

s

1 2 3 40 a/b1 c/d2 e/f3 g/h/i/j4 k/l/m5 a/b6 c/d7 e/f8 g/h/i/j9 k/l/m10 a/b11 c/d12 e/f13 g/h/i/j14 k/l/m

If LC=1000 for the originalloop, how long does it take for this loop

(unrolled 4 times) to execute?

It takes 250*5=1250 cycles.Thus the loop is

4000/1250 = 3.2 times faster

Page 31: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

31

Loop Optimization:Loop Unrolling

In the previous example we obtained a good utilization of the functional units through loop unrolling.

But at the cost of code expansion and higher register pressure.

Software Pipelining offers an alternativeby overlapping the execution of operationsfrom multiple iterations of the loop.

Page 32: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

32

Loop Optimization:Software Pipelining

(S1) ld4 r4 = [r5], 4 (S2) - - -(S3) add r7 = r4, r9 (S4) st4 [r6] = r7, 4

Cycle

s

* This is not real code

Iterations1

0 S112 S33 S4456789

2 3 4

S1S1

S3 S1S4 S3

S4 S3S4

5 6 7

S1S1

S3 S1S4 S3

S4 S3S4

prologue

kernel

epilogue

Page 33: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

33

Loop Optimization:Software Pipelining Code

ld4 r4 = [r5], 4 ;; // load x[1] ld4 r4 = [r5], 4 ;; // load x[2] add r7 = r4, r9 // y[1] = x[1]+ k

ld4 r4 = [r5], 4 ;; // load x[3]

L1: ld4 r4 = [r5], 4 // load x[i+3] add r7 = r4, r9 // y[i+1] = x[i+1] + k st4 [r6] = r7, 4 // store y[i] br.cloop L1 ;;

st4 [r6] = r7, 4 // store y[n-2]add r7 = r4, r9 ;; // y[n-1] = x[n-1] + kst4 [r6] = r7, 4 // store y[n-1]add r7 = r4,r9 ;; // y[n] = x[n] + kst4 [r6] = r7, 4 // store y[n]

prologue

kernel

epilogue

Page 34: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

34

Support for Software Pipelining in the IA-64

After a loop is converted into a software pipeline,it looks quite different from the original loop, Intel adopts the following terminology:

source loop and source iteration: refer to the original source code

kernel loop and kernel iteration: refer to the code that implements the software pipeline.

Page 35: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

35

Loop Support in the IA-64:Register Rotation

The IA-64 has a rotating register base (rrb)register that is decremented by specialsoftware pipelined loop branches.

When the rrb is decremented the valued storedin register X appear to move to register X+1,and the value of the highest numbered rotatingregister appears to move to the lowest numbered rotating register.

Page 36: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

36

Loop Support in the IA-64:Register Rotation

What registers can rotate? The predicate registers p16-p63; The floating-point registers f32-f127; A programable portion of the general

registers:The function alloc can allocate 0, 8, 16, 24, …,

96 general registers as rotating registersThe lowest numbered rotating register is r32.

There are three rrb: rrb.gr, rrb.fr rrb.pr

Page 37: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

37

How Register Rotation Helps Software Pipeline

The concept of a software pipelining branch:

L1: ld4 r35 = [r4], 4 // post-increment by 4 st4 [r5] = r37, 4 // post-increment by 4

swp_branch L1 ;;

The pseudo-instruction swp_branch in the example rotatesthe general registers.

Therefore the value stored into r35 is read in r37 two kerneliterations (and two rotations) later.

The register rotation eliminated a dependence betweenthe load and the store instructions, and allowed the loop toexecute in one cycle.

Page 38: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

38

How Register Rotation Helps Software Pipeline

The concept of a software pipelining branch:

L1: ld4 r35 = [r4], 4 // post-increment by 4 st4 [r5] = r37, 4 // post-increment by 4

swp_branch L1 ;;

7

R32R33

R35R34

R36R37R38R39

0RRB

Physical Logical

R35

R37

87

R32R33

R35R34

R36R37R38R39

-1RRB

Physical Logical

R35

R37

987

R32R33

R35R34

R36R37R38R39

-2RRB

Physical Logical

R35

R37

Page 39: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

39

The stage predicate

(S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - -(S3): (p18) add r7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4

When assembling a software pipeline the programmer canassign a stage predicate to each stage of the pipeline tocontrol the execution of the instructions in that stage.

p16 is architecturally defined as the predicate for the first stage,p17 for the second, and so on.

The software pipeline branch rotates the predicate registers andinjects a 1 in p16. Thus enabling one stage of the pipelineat a time for the execution of the prolog.

Page 40: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

40

The stage predicate

(S1): (p16) ld4 r4 = [r5], 4 (S2): (p17) - - -(S3): (p18) add r7 = r4, r9 (S4): (p19) st4 [r6] = r7, 4

When the kernel counter reaches zero, the softwarepipeline branch starts to decrement the epilog counterand injects 0 in p16 at every rotation to execute theepilogue of the software pipelined loop.

Page 41: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

41

Anatomy of a Software Pipelining Branch

LC?

PR[16]=1

RRB--

branch

PR[16]=0

RRB--

PR[16]=0PR[16]=0

RRB--

fall-thru

EC?

== 0 (epilog)

EC--

>1

EC--

=1

EC

=0

LC--

0(prolog/kernel)

Page 42: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

42

Software Pipelining Example in the IA-64

mov pr.rot = 0 // Clear all rotating predicate registerscmp.eq p16,p0 = r0,r0 // Set p16=1mov ar.lc = 4 // Set loop counter to n-1mov ar.ec = 3 // Set epilog counter to 3

…loop:(p16) ldl r32 = [r12], 1 // Stage 1: load x(p17) add r34 = 1, r33 // Stage 2: y=x+1(p18) stl [r13] = r35,1 // Stage 3: store y

br.ctop loop // Branch back

Page 43: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

43

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x132 33 34 35 36 37 38

General Registers (Physical)

0 0116 17 18

Predicate Registers

4

LC

3

EC

x4x5

x1x2x3

Memory

39

32 33 34 35 36 37 38 39

General Registers (Logical)

0

RRB

Page 44: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

44

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

0 0116 17 18

Predicate Registers

4

LC

3

EC

x4x5

x1x2x3

Memory

x132 33 34 35 36 37 38

General Registers (Physical)

39

32 33 34 35 36 37 38 39

General Registers (Logical)

0

RRB

Page 45: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

45

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

0 0116 17 18

Predicate Registers

4

LC

3

EC

x4x5

x1x2x3

Memory

x132 33 34 35 36 37 38

General Registers (Physical)

39

32 33 34 35 36 37 38 39

General Registers (Logical)

0

RRB

Page 46: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

46

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

0 0116 17 18

Predicate Registers

4

LC

3

EC

1

x4x5

x1x2x3

Memory

x133 34 35 36 37 38 39

General Registers (Physical)

32

32 33 34 35 36 37 38 39

General Registers (Logical)

-1

RRB

Page 47: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

47

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 0116 17 18

Predicate Registers

3

LC

3

EC

x4x5

x1x2x3

Memory

x133 34 35 36 37 38 39

General Registers (Physical)

32

32 33 34 35 36 37 38 39

General Registers (Logical)

-1

RRB

Page 48: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

48

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 0116 17 18

Predicate Registers

3

LC

3

EC

x4x5

x1x2x3

Memory

x133 34 35 36 37 38 39

General Registers (Physical)

32

32 33 34 35 36 37 38 39

General Registers (Logical)

x2

-1

RRB

Page 49: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

49

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 0116 17 18

Predicate Registers

3

LC

3

EC

x4x5

x1x2x3

Memory

x133 34 35 36 37 38 39

General Registers (Physical)

32

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1

-1

RRB

Page 50: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

50

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 0116 17 18

Predicate Registers

3

LC

3

EC

x4x5

x1x2x3

Memory

x133 34 35 36 37 38 39

General Registers (Physical)

32

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1

-1

RRB

Page 51: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

51

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 0116 17 18

Predicate Registers

3

LC

3

EC

x4x5

x1x2x3

Memory

x133 34 35 36 37 38 39

General Registers (Physical)

32

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1

-1

RRB

Page 52: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

52

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 1116 17 18

Predicate Registers

2

LC

3

EC

1

x4x5

x1x2x3

Memory

x134 35 36 37 38 39 32

General Registers (Physical)

33

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1

-2

RRB

Page 53: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

53

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 1116 17 18

Predicate Registers

2

LC

3

EC

x4x5

x1x2x3

Memory

x134 35 36 37 38 39 32

General Registers (Physical)

33

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1 x3

-2

RRB

Page 54: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

54

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

y2

1 1116 17 18

Predicate Registers

2

LC

3

EC

x4x5

x1x2x3

Memory

34 35 36 37 38 39 32

General Registers (Physical)

33

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1 x3

-2

RRB

Page 55: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

55

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 1116 17 18

Predicate Registers

2

LC

3

EC

x4x5

x1x2x3 y1

Memory

y234 35 36 37 38 39 32

General Registers (Physical)

33

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1 x3

-2

RRB

Page 56: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

56

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 1116 17 18

Predicate Registers

2

LC

3

EC

x4x5

x1x2x3 y1

Memory

y234 35 36 37 38 39 32

General Registers (Physical)

33

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1 x3

-2

RRB

Page 57: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

57

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 11

16 17 18

Predicate Registers

1

LC

3

EC

1

x4x5

x1x2x3 y1

Memory

-3

RRB

y235 36 37 38 39 32 33

General Registers (Physical)

34

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1 x3

Page 58: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

58

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 1116 17 18

Predicate Registers

1

LC

3

EC

x4x5

x1x2x3 y1

Memory

-3

RRB

y2 x435 36 37 38 39 32 33

General Registers (Physical)

34

32 33 34 35 36 37 38 39

General Registers (Logical)

x2y1 x3

Page 59: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

59

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 1116 17 18

Predicate Registers

1

LC

3

EC

x4x5

x1x2x3 y1

Memory

y2 x435 36 37 38 39 32 33

General Registers (Physical)

34

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 x3

-3

RRB

Page 60: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

60

Software Pipelining Example in the IA-64

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1 1116 17 18

Predicate Registers

1

LC

3

EC

x4x5

x1x2x3 y1

y2

Memory

y2 x435 36 37 38 39 32 33

General Registers (Physical)

34

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 x3

-3

RRB

Page 61: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

61

Software Pipelining Example in the IA-64

1 1116 17 18

Predicate Registers

1

LC

3

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3 y1

y2

Memory

y2 x435 36 37 38 39 32 33

General Registers (Physical)

34

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 x3

-3

RRB

Page 62: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

62

Software Pipelining Example in the IA-64

1 1116 17 18

Predicate Registers

0

LC

3

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

1

x4x5

x1x2x3 y1

y2

Memory

-4

RRB

y2 x436 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 x3

Page 63: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

63

Software Pipelining Example in the IA-64

1 1116 17 18

Predicate Registers

0

LC

3

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3 y1

y2

Memory

y2 x5 x436 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 x3

-4

RRB

Page 64: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

64

Software Pipelining Example in the IA-64

1 1116 17 18

Predicate Registers

0

LC

3

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3 y1

y2

Memory

y2 x5 x436 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-4

RRB

Page 65: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

65

Software Pipelining Example in the IA-64

1 1116 17 18

Predicate Registers

0

LC

3

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3 y1

y2y3

Memory

-4

RRB

y2 x5 x436 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

Page 66: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

66

Software Pipelining Example in the IA-64

1 1116 17 18

Predicate Registers

0

LC

3

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3 y1

y2y3

Memory

y2 x5 x436 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-4

RRB

Page 67: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

67

Software Pipelining Example in the IA-64

1 1016 17 18

Predicate Registers

0

LC

2

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

0

x4x5

x1x2x3 y1

y2y3

Memory

y2 x5 x437 38 39 32 33 34 35

General Registers (Physical)

36

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-5

RRB

Page 68: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

68

Software Pipelining Example in the IA-64

1 1016 17 18

Predicate Registers

0

LC

2

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

0

x4x5

x1x2x3 y1

y2y3

Memory

y2 x5 x437 38 39 32 33 34 35

General Registers (Physical)

36

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-5

RRB

Page 69: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

69

Software Pipelining Example in the IA-64

1 1016 17 18

Predicate Registers

0

LC

2

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3 y1

y2y3

Memory

y2 x5 x437 38 39 32 33 34 35

General Registers (Physical)

36

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-5

RRB

Page 70: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

70

Software Pipelining Example in the IA-64

1 1016 17 18

Predicate Registers

0

LC

2

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3 y1

y2y3

Memory

y2 x5 y537 38 39 32 33 34 35

General Registers (Physical)

36

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-5

RRB

Page 71: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

71

Software Pipelining Example in the IA-64

1 1016 17 18

Predicate Registers

0

LC

2

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3

y4

y1y2y3

Memory

y2 x5 y537 38 39 32 33 34 35

General Registers (Physical)

36

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-5

RRB

Page 72: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

72

Software Pipelining Example in the IA-64

1 1016 17 18

Predicate Registers

0

LC

2

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3

y4

y1y2y3

Memory

y2 x5 y537 38 39 32 33 34 35

General Registers (Physical)

36

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-5

RRB

Page 73: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

73

Software Pipelining Example in the IA-64

0 1016 17 18

Predicate Registers

0

LC

1

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

0

x4x5

x1x2x3

y4

y1y2y3

Memory

y2 x5 y536 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-6

RRB

Page 74: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

74

Software Pipelining Example in the IA-64

0 1016 17 18

Predicate Registers

0

LC

1

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3

y4

y1y2y3

Memory

y2 x5 y536 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-6

RRB

Page 75: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

75

Software Pipelining Example in the IA-64

0 1016 17 18

Predicate Registers

0

LC

1

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3

y4

y1y2y3

Memory

y2 x5 y536 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-6

RRB

Page 76: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

76

Software Pipelining Example in the IA-64

0 1016 17 18

Predicate Registers

0

LC

1

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3

y4y5

y1y2y3

Memory

y2 x5 y536 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-6

RRB

Page 77: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

77

Software Pipelining Example in the IA-64

0 1016 17 18

Predicate Registers

0

LC

1

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3

y4y5

y1y2y3

Memory

y2 x5 y536 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-6

RRB

Page 78: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

78

Software Pipelining Example in the IA-64

0 1016 17 18

Predicate Registers

0

LC

1

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

x4x5

x1x2x3

y4y5

y1y2y3

Memory

y2 x5 y536 37 38 39 32 33 34

General Registers (Physical)

35

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-6

RRB

Page 79: CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic F: IA-64 Hardware Support for Software Pipelining José Nelson Amaral amaral/courses/680.

CMPUT 680 - Compiler Design and Optimization

79

Software Pipelining Example in the IA-64

0 0016 17 18

Predicate Registers

0

LC

0

EC

loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1

br.ctop loop

0

x4x5

x1x2x3

y4y5

y1y2y3

Memory

y2 x5 y537 38 39 32 33 34 35

General Registers (Physical)

36

32 33 34 35 36 37 38 39

General Registers (Logical)

y3y1 y4

-7

RRB