Compiling for VLIWs and ILP

1

Compiling for VLIWs and ILP

Profiling Region formation Acyclic scheduling Cyclic scheduling

2

Profiling

Many crucial ILP optimizations require good profile information

ILP optimizations try to maximize performance/price by increasing the IPC

Compiler techniques are needed to expose and enhance ILP

Two types of profiles: point profiles and path profiles

3

Compiling with Profiling

4

Point Profiles

“Point profiles” collect statistics about points in call graphs and control flow graphs

gprof produces call graph profiles, statistics on how many times a function was called, who called it, and (sometimes) how much time was spent in that function

Control flow graph profiles give statistics on nodes (node profiles) and edges (edge profiles)

5

Path Profiles

“Path profiles” measure the execution frequency of a sequence of basic blocks on a path a CFG

A “hot path” is a path that is (very) frequently executed

Types include forward paths (no backedges), bounded-length paths (start/stop points), and whole-program paths (interprocedural)

The choice is a tradeoff between accuracy and efficiency to collect the profile

B1

B2

B3

B5

B4

B6

B7

Path 1 {B1, B2, B3, B5, B7} count = 7Path 2 {B1, B2, B3, B6, B7} count = 9Path 3 {B1, B2, B4, B6, B7} count = 123

6

Profile Collection

Data collected through code instrumentation is very detailed, but instrumentation overhead affects execution

Hardware counters have very low overhead but information is not exhaustive

Interrupt-based sampling examines machine state in intervals

Collecting path profiles requires enumerating the set of paths encountered during runtimeInstrumentation inserts

instructions to recordedge profiling events

7

Profile Bookkeeping

Problem: compiler optimization modifies (instrumented) code in ways that change the use and applicability of profile information for later compilation stages

Apply profiling right before profile data is needed Axiom of profile uniformity: “When one copies a chunk of

a program, one should equally divide the profile frequency of the original chunk among the copies.”

Use this axiom for point profiles as a simple heuristic Path profiles correlate branches and thus path-based

compiler optimizations preserve these profiles

8

Instruction Scheduling

Instruction scheduling is the most fundamental ILP-oriented compilation phase

Responsible for identifying and grouping operations that can be executed in parallel

Two approaches: Cyclic schedulers operate on

loops to exploit ILP in (tight) loop nests usually without control flow

Acyclic schedulers consider loop-free regions

Region shape

Acyclic Cyclic

Basicblock

Super-block

Trace DAG

9

Acyclic Scheduling of Basic Block Region Shapes

Region is restricted to single basic block

Local scheduling of instructions in a single basic block is simple

ILP is exposed by bundling operations into VLIW instructions (instruction formation or instruction compaction)

add $r13 = $r3, $r0shl $r13 = $r13, 3ld.w $r14 = 0[$r4]sub $r16 = $r6, 3shr $r15 = $r15, 9

add $r13 = $r3, $r0sub $r16 = $r6, 3;; ## end of 1st instr.shl $r13 = $r13, 3shr $r15 = $r15, 9ld.w $r14 = 0[$r4];; ## end of 2nd instr.

bundle

bundle

10

Intermezzo: VLIW Encoding

A VLIW schedule can be encoded compactly using horizontal and vertical nops

Start bits, stop bits, or instruction templates are used to compress the VLIW instructions into variable-width instruction bundles


11

Intermezzo: VLIW Execution Model Subtleties

Horizontal issues within an instruction: A read sees the original value

of a register A read sees the value written

by the write Read and write to same

register is illegal Also exception issues

Vertical issues across pipelined instructions: EQ model LEQ model

mov $r1 = 2;;mov $r0 = $r1mov $r1 = 3;;

ld.w $r0 = 0[$r1];;add $r0 = $r1, $r2;;sub $r3 = $r0, $r4…# load completed:add $r3 = $r3, $r0

EQ model allows $r0 to be reusedbetween issue of 1st instruction andits completion when latency expires

12

Acyclic Region Scheduling for Loops

To fulfill the need to enlarge the region size of a loop body to expose more ILP, apply: Loop fusion Loop peeling Loop unrolling

DO I = 1, N A(I) = C*A(I)ENDDODO I = 1, N D(I) = A(I)*B(I)ENDDO

DO I = 1, N A(I) = C*A(I) D(I) = A(I)*B(I)ENDDO

DO I = 1, N, 2 A(I) = C*A(I) D(I) = A(I)*B(I) A(I+1) = C*A(I+1) D(I+1) = A(I+1)*B(I+1)ENDDO (Assuming 2 divides N)

13

Region Scheduling Across Basic Blocks

Region scheduling schedules operations across basic blocks, usually on hot paths

Fulfill the need to increase the region size by merging operations from block to expose more ILP

But problem with conditional flow: how to move operations from one block to another for instruction scheduling?

B3

B6

B4

Move operationfrom here to there

But operation is nowmissing on this path

14

Region Scheduling Across Basic Blocks

Problem: how to move operations from one block to another for instruction scheduling?

Affected branches need to be compensated

B3

B6

B4

Move operationfrom here to there

But operation is nowinserted on this path

15

Trace Scheduling

Earliest region scheduling approach has restrictions

A trace consists of a the operations from a list of basic blocks B0, B1, …, Bn

1. Each Bi is a predecessor (falls through or branches to) the next Bi+1on the list

2. For any i and k there is no path BiBkBi except for i=0, i.e. the code is cycle free except that the entire region can be part of a loop

B1

B2 B5

B3

B6

B4

10

70 30

70 30

20

80

80

1090

B1

B2 B5

B3

B6

B4

10

70 30

70 30

20

80

80

1090

16

Superblocks

Superblocks are single-entry multiple-exit traces

Superblock formation uses tail duplication to to eliminate side entrances

1. Each Bi is a predecessor of the next Bi+1on the list (fall through)

2. For any i and k there is no path BiBkBi except for i=0

3. There are no branches into a block in the region (no side entrances), except to B0

B1

B2 B5

B3

B6

B4

10

70 30

70 30

80

20

20

1090

B1

B2 B5

B3

B6

B4

10

7030

70 30

56

5.6 4.450.4

B3’

B4’

14

39.6

6

20

24

17

Hyperblocks

Hyperblocks are single-entry multiple-exit traces with internal control flow effectuated via instruction predication

If-conversion folds flow into single block using instruction predication

B1

B2 B5

B3

B6

B4

10

70 30

70 30

80

20

20

1090

B1

B2,B5

B3

B6

B4

10

20

80

8 272

B4’

20

18

20

18

Intermezzo: Predication

If-conversion translates control dependences into data dependences by instruction predication to conditionally execute them

Predication requires hardware support

Full predication adds a boolean operand to (all or selected) instructions

Partial predication executes all instructions, but selects the final result based on a condition

cmpgt $b1 = $r5, 0 ;; br $b1, L1 ;; mpy $r3 = $r1, $r2 ;;L1: stw 0[$r10] = $r3 ;;

cmpgt $p1 = $r5, 0;;($p1) mpy $r3 = $r1, $r2;;stw 0[$r10] = $r3;;

mpy $r4 = $r1, $r2;;cmpgt $b1 = $r5, 0;;slct $r3 = $b1, $r4, $r3;;stw 0[$r10] = $r3;;

Original

After fullpredication

After partialprediction

19

Treegions

Treegions are regions containing a trees of blocks such that no block in a treegion has side entrances

Any path through a treegion is a superblock

Treegion 1

Treegion 3

Treegion 2

20

Region Formation

The scheduler constructs schedules for a single region at a time

Need to select which region to optimize (within limits of regions shape), i.e. group traces of frequently executed blocks into regions

May need to enlarge regions to expose enough ILP for scheduler

Regionenlargement

Scheduleconstruction

Regionselection

21

Region Selection by Trace Growing

Trace growing uses the mutual most likely heuristic:

Suppose A is last block in trace Add block B to trace if B is

most likely successor of A and A is B’s most likely predecessor

Also works to grow backward Requires edge profiling, but

result can be poor because edge profiling does not correlate branch probabilities

A

B10 5 40 405

55

22

Region Selection by Path Profiling

Treat trace as a path and consider its execution frequency by path profiling

Correlations are preserved in the region formation process

B1

B2 B5

B3

B6

B4

B1

B2 B5

B3

B6

B4

B3’

B4’

path 1: {B1, B2, B3, B4} count = 44path 2: {B1, B2, B3, B6, B4} count = 0path 3: {B1, B5, B3, B4} count = 16path 4: {B1, B5, B3, B6, B4} count = 12

23

Superblock Enlargement by Target Expansion

Target expansion is useful when the branch at the end of a superblock has a high probability but the superblock cannot be enlarged due to a side entrance

Duplicate sequence of target blocks to a create larger superblock

B1

B2

80

B3

B4

2010

70

90

B1

B2

80

B3’

B4’

20

10

70

B3

B4

20

24

Superblock Enlargement by Loop Peeling

Peel a number of iterations of a small loop body to create a larger superblock that branches into the loop

Useful when profiled loop iterations is bounded to a small constant (two iterations in the example)

B1

B2

10

1010

B1

B2

10

10

B1”

B2”

00

B1’

B2’

10

0

0

25

Superblock Enlargement by Loop Unrolling

Loops with a superblock body and a backedge with high probability are called superblock loops

When a superblock loop is small we can unroll the loop

B1

B2

10

1090

B1

B2

10

10

30

B1’

B2’

B1”

B2”

3.3

3.3

3.3

26

Exposing ILP After Loop Unrolling

Loop unrolling exposes limited amount of ILP

Cross-iteration dependences on the loop counter’s updates prevent parallel execution of the copies of the loop body

Cannot generally move instructions across split points

Note: can use speculative execution to hoist instructions above split points

B1

B2

10Split point

B1’

27

Exposing ILP with Renaming and Copy Propagation

28

Schedule Construction

The schedule constructor (scheduler) uses compaction techniques to produce a schedule for a region after region formation

The goal is to minimize an objective cost function while maintaining program correctness and obeying resource limitations: Increase speed by reducing

completion time Reduce code size Increase energy efficiency

Regionenlargement

Scheduleconstruction

Regionselection

29

Schedule Construction and Explicitly Parallel Architectures

A scheduler for an explicitly parallel architecture such as VLIW and EPIC uses the exposed ILP to statically schedule instructions in parallel

Instruction compaction must obey data dependences (RAW, WAR, and WAW) and control dependences to ensure correctness

add $r13 = $r3, $r0shl $r13 = $r13, 3ld.w $r14 = 0[$r4]sub $r16 = $r6, 3shr $r15 = $r15, 9


bundle

bundle

30

Schedule Construction and Instruction Latencies

Instruction latencies must be taken into account by the scheduler, but they’re not always fixed or the same for all ops

A scheduler can assume average or worst-case instruction latencies

Hide instruction latencies by ensuring that there is sufficient height between instruction issue and when result is needed to avoid pipeline stalls

Also recall the difference between the EQ versus the LEQ model

mul $r3 = $r3, $r1

add $r13 = $r2, $r3 ld.w $r14 = 0[$r5] add $r13 = $r13, $r14

ld.w $r15 = 0[$r6]

Takes 2 cyclesto complete

Takes >3 cycles(4 cycles ave.)

RAW hazards

Takes 1 cycleto complete

31

Linear Scheduling Techniques

Instruction compaction using linear-time scans over region:

As-soon-as-possible (ASAP) scheduling places ops in the earliest possible cycle using top-down scan

As-late-as-possible (ALAP) scheduling places ops in the latest possible cycle using bottom-up scan

Critical-path (CP) scheduling uses ASAP followed by ALAP

Resource hazard detection is local

mul $r3 = $r3, $r1add $r13 = $r2, $r3ld.w $r14 = 0[$r5]add $r13 = $r13, $r14ld.w $r15 = 0[$r6]

02031

cycle

mul $r3 = $r3, $r1ld.w $r14 = 0[$r5];;ld.w $r15 = 0[$r6];; add $r13 = $r2, $r3;; add $r13 = $r13, $r14;;

At most oneload per inst.

32

List Scheduling

List scheduling schedules operations from the global region based on a data dependence graph (DDG) or program dependence graph (PDG) which both have O(n2) complexity

Repeatedly selects an operation from a data-ready queue (DRQ), where an operation is ready when all if its DDG predecessors have been scheduled

for each root r in the PDG sorted by priority do enqueue(r)while DRQ is non-empty do h = dequeue() schedule(h) for each DAG successor s of h do if all predecessors of s have been scheduled then enqueue(s)

33

Data Dependence Graph

The data dependence graph (DDG) Nodes are operations Edges are RAW, WAR, and

WAW dependences

34

Control Flow Dependence

35

Compensation Code

Compensation code is needed when operations are scheduled across basic blocks in a region

Compensation code corrects scheduling changes by duplicating code on entries and exits from a scheduled region

A

B

X

C

Y

Schedulerinterchanges

A with B

entry

exit

Entry and/or exitmust be compensated

36

No Compensation

No compensation code is needed when block B does not have an entry and exitB

A

X

C

Y

A

B

X

C

Y

37

Join Compensation

Join compensation is applied when block B has an entry

Duplicate block BB

A

X

C

Y

A

B

X

C

Y

B’

Z Z

38

Split Compensation

Split compensation is applied when block B has an exit

Duplicate block AB

A

X

C

Y

A

B

X

C

Y

A’

W W

39

Join-Split Compensation

Join-split compensation is applied when block B has an entry and an exit

Duplicate block A and BB

A

X

C

Y

A

B

X

C

Y

A’W

W

B’Z

Z

W

40

Resource Management with Reservation Tables

A resource reservation table records which resources are busy per cycle

Reservation tables allow easy scheduling of operations by matching the operation’s required resources to empty slots

Construction of reservation table at a join point in the CFG is constructed by merging busy slots from both branches

CycleInteger

ALUFP

ALUMEM Branch

0 busy busy

1 busy busy

2 busy

3 busy busy

41

SoftwarePipelining

DO i = 0, 6 A B C D E F G HENDDO

Assuming that the initiationinterval (II) is 3 cycles

prologue

epilogue

kernel

42

Software Pipelining Example

> 3 cycles

> 2 cycles

>1 cycle

43

Modulo Scheduling

DDG

MRT

44

Constructing Kernel-Only Code by Predicate Register Rotation

BRT branches to the top and rotates the predicate registers:p1 = p0, p2 = p1, p3 = p2, p0 = p3

45

Modulo Variable Expansion (1)

46

Modulo Variable Expansion (2)

Compiling for VLIWs and ILP

Documents

path profilescompiling

nodes node profiles

simple heuristicpath

ilptwo types of profiles

profile frequency

profile data

profile collectiondata

cfga hot path