1
Compiling for VLIWs and ILP
Profiling Region formation Acyclic scheduling Cyclic scheduling
2
Profiling
Many crucial ILP optimizations require good profile information
ILP optimizations try to maximize performance/price by increasing the IPC
Compiler techniques are needed to expose and enhance ILP
Two types of profiles: point profiles and path profiles
3
Compiling with Profiling
4
Point Profiles
“Point profiles” collect statistics about points in call graphs and control flow graphs
gprof produces call graph profiles, statistics on how many times a function was called, who called it, and (sometimes) how much time was spent in that function
Control flow graph profiles give statistics on nodes (node profiles) and edges (edge profiles)
5
Path Profiles
“Path profiles” measure the execution frequency of a sequence of basic blocks on a path a CFG
A “hot path” is a path that is (very) frequently executed
Types include forward paths (no backedges), bounded-length paths (start/stop points), and whole-program paths (interprocedural)
The choice is a tradeoff between accuracy and efficiency to collect the profile
B1
B2
B3
B5
B4
B6
B7
Path 1 {B1, B2, B3, B5, B7} count = 7Path 2 {B1, B2, B3, B6, B7} count = 9Path 3 {B1, B2, B4, B6, B7} count = 123
6
Profile Collection
Data collected through code instrumentation is very detailed, but instrumentation overhead affects execution
Hardware counters have very low overhead but information is not exhaustive
Interrupt-based sampling examines machine state in intervals
Collecting path profiles requires enumerating the set of paths encountered during runtimeInstrumentation inserts
instructions to recordedge profiling events
7
Profile Bookkeeping
Problem: compiler optimization modifies (instrumented) code in ways that change the use and applicability of profile information for later compilation stages
Apply profiling right before profile data is needed Axiom of profile uniformity: “When one copies a chunk of
a program, one should equally divide the profile frequency of the original chunk among the copies.”
Use this axiom for point profiles as a simple heuristic Path profiles correlate branches and thus path-based
compiler optimizations preserve these profiles
8
Instruction Scheduling
Instruction scheduling is the most fundamental ILP-oriented compilation phase
Responsible for identifying and grouping operations that can be executed in parallel
Two approaches: Cyclic schedulers operate on
loops to exploit ILP in (tight) loop nests usually without control flow
Acyclic schedulers consider loop-free regions
Region shape
Acyclic Cyclic
Basicblock
Super-block
Trace DAG
9
Acyclic Scheduling of Basic Block Region Shapes
Region is restricted to single basic block
Local scheduling of instructions in a single basic block is simple
ILP is exposed by bundling operations into VLIW instructions (instruction formation or instruction compaction)
add $r13 = $r3, $r0shl $r13 = $r13, 3ld.w $r14 = 0[$r4]sub $r16 = $r6, 3shr $r15 = $r15, 9
add $r13 = $r3, $r0sub $r16 = $r6, 3;; ## end of 1st instr.shl $r13 = $r13, 3shr $r15 = $r15, 9ld.w $r14 = 0[$r4];; ## end of 2nd instr.
bundle
bundle
10
Intermezzo: VLIW Encoding
A VLIW schedule can be encoded compactly using horizontal and vertical nops
Start bits, stop bits, or instruction templates are used to compress the VLIW instructions into variable-width instruction bundles
add $r13 = $r3, $r0sub $r16 = $r6, 3;; ## end of 1st instr.shl $r13 = $r13, 3shr $r15 = $r15, 9ld.w $r14 = 0[$r4];; ## end of 2nd instr.
11
Intermezzo: VLIW Execution Model Subtleties
Horizontal issues within an instruction: A read sees the original value
of a register A read sees the value written
by the write Read and write to same
register is illegal Also exception issues
Vertical issues across pipelined instructions: EQ model LEQ model
mov $r1 = 2;;mov $r0 = $r1mov $r1 = 3;;
ld.w $r0 = 0[$r1];;add $r0 = $r1, $r2;;sub $r3 = $r0, $r4…# load completed:add $r3 = $r3, $r0
EQ model allows $r0 to be reusedbetween issue of 1st instruction andits completion when latency expires
12
Acyclic Region Scheduling for Loops
To fulfill the need to enlarge the region size of a loop body to expose more ILP, apply: Loop fusion Loop peeling Loop unrolling
DO I = 1, N A(I) = C*A(I)ENDDODO I = 1, N D(I) = A(I)*B(I)ENDDO
DO I = 1, N A(I) = C*A(I) D(I) = A(I)*B(I)ENDDO
DO I = 1, N, 2 A(I) = C*A(I) D(I) = A(I)*B(I) A(I+1) = C*A(I+1) D(I+1) = A(I+1)*B(I+1)ENDDO (Assuming 2 divides N)
13
Region Scheduling Across Basic Blocks
Region scheduling schedules operations across basic blocks, usually on hot paths
Fulfill the need to increase the region size by merging operations from block to expose more ILP
But problem with conditional flow: how to move operations from one block to another for instruction scheduling?
B3
B6
B4
Move operationfrom here to there
But operation is nowmissing on this path
14
Region Scheduling Across Basic Blocks
Problem: how to move operations from one block to another for instruction scheduling?
Affected branches need to be compensated
B3
B6
B4
Move operationfrom here to there
But operation is nowinserted on this path
15
Trace Scheduling
Earliest region scheduling approach has restrictions
A trace consists of a the operations from a list of basic blocks B0, B1, …, Bn
1. Each Bi is a predecessor (falls through or branches to) the next Bi+1on the list
2. For any i and k there is no path BiBkBi except for i=0, i.e. the code is cycle free except that the entire region can be part of a loop
B1
B2 B5
B3
B6
B4
10
70 30
70 30
20
80
80
1090
B1
B2 B5
B3
B6
B4
10
70 30
70 30
20
80
80
1090
16
Superblocks
Superblocks are single-entry multiple-exit traces
Superblock formation uses tail duplication to to eliminate side entrances
1. Each Bi is a predecessor of the next Bi+1on the list (fall through)
2. For any i and k there is no path BiBkBi except for i=0
3. There are no branches into a block in the region (no side entrances), except to B0
B1
B2 B5
B3
B6
B4
10
70 30
70 30
80
20
20
1090
B1
B2 B5
B3
B6
B4
10
7030
70 30
56
5.6 4.450.4
B3’
B4’
14
39.6
6
20
24
17
Hyperblocks
Hyperblocks are single-entry multiple-exit traces with internal control flow effectuated via instruction predication
If-conversion folds flow into single block using instruction predication
B1
B2 B5
B3
B6
B4
10
70 30
70 30
80
20
20
1090
B1
B2,B5
B3
B6
B4
10
20
80
8 272
B4’
20
18
20
18
Intermezzo: Predication
If-conversion translates control dependences into data dependences by instruction predication to conditionally execute them
Predication requires hardware support
Full predication adds a boolean operand to (all or selected) instructions
Partial predication executes all instructions, but selects the final result based on a condition
cmpgt $b1 = $r5, 0 ;; br $b1, L1 ;; mpy $r3 = $r1, $r2 ;;L1: stw 0[$r10] = $r3 ;;
cmpgt $p1 = $r5, 0;;($p1) mpy $r3 = $r1, $r2;;stw 0[$r10] = $r3;;
mpy $r4 = $r1, $r2;;cmpgt $b1 = $r5, 0;;slct $r3 = $b1, $r4, $r3;;stw 0[$r10] = $r3;;
Original
After fullpredication
After partialprediction
19
Treegions
Treegions are regions containing a trees of blocks such that no block in a treegion has side entrances
Any path through a treegion is a superblock
Treegion 1
Treegion 3
Treegion 2
20
Region Formation
The scheduler constructs schedules for a single region at a time
Need to select which region to optimize (within limits of regions shape), i.e. group traces of frequently executed blocks into regions
May need to enlarge regions to expose enough ILP for scheduler
Regionenlargement
Scheduleconstruction
Regionselection
21
Region Selection by Trace Growing
Trace growing uses the mutual most likely heuristic:
Suppose A is last block in trace Add block B to trace if B is
most likely successor of A and A is B’s most likely predecessor
Also works to grow backward Requires edge profiling, but
result can be poor because edge profiling does not correlate branch probabilities
A
B10 5 40 405
55
22
Region Selection by Path Profiling
Treat trace as a path and consider its execution frequency by path profiling
Correlations are preserved in the region formation process
B1
B2 B5
B3
B6
B4
B1
B2 B5
B3
B6
B4
B3’
B4’
path 1: {B1, B2, B3, B4} count = 44path 2: {B1, B2, B3, B6, B4} count = 0path 3: {B1, B5, B3, B4} count = 16path 4: {B1, B5, B3, B6, B4} count = 12
23
Superblock Enlargement by Target Expansion
Target expansion is useful when the branch at the end of a superblock has a high probability but the superblock cannot be enlarged due to a side entrance
Duplicate sequence of target blocks to a create larger superblock
B1
B2
80
B3
B4
2010
70
90
B1
B2
80
B3’
B4’
20
10
70
B3
B4
20
24
Superblock Enlargement by Loop Peeling
Peel a number of iterations of a small loop body to create a larger superblock that branches into the loop
Useful when profiled loop iterations is bounded to a small constant (two iterations in the example)
B1
B2
10
1010
B1
B2
10
10
B1”
B2”
00
B1’
B2’
10
0
0
25
Superblock Enlargement by Loop Unrolling
Loops with a superblock body and a backedge with high probability are called superblock loops
When a superblock loop is small we can unroll the loop
B1
B2
10
1090
B1
B2
10
10
30
B1’
B2’
B1”
B2”
3.3
3.3
3.3
26
Exposing ILP After Loop Unrolling
Loop unrolling exposes limited amount of ILP
Cross-iteration dependences on the loop counter’s updates prevent parallel execution of the copies of the loop body
Cannot generally move instructions across split points
Note: can use speculative execution to hoist instructions above split points
B1
B2
10Split point
B1’
27
Exposing ILP with Renaming and Copy Propagation
28
Schedule Construction
The schedule constructor (scheduler) uses compaction techniques to produce a schedule for a region after region formation
The goal is to minimize an objective cost function while maintaining program correctness and obeying resource limitations: Increase speed by reducing
completion time Reduce code size Increase energy efficiency
Regionenlargement
Scheduleconstruction
Regionselection
29
Schedule Construction and Explicitly Parallel Architectures
A scheduler for an explicitly parallel architecture such as VLIW and EPIC uses the exposed ILP to statically schedule instructions in parallel
Instruction compaction must obey data dependences (RAW, WAR, and WAW) and control dependences to ensure correctness
add $r13 = $r3, $r0shl $r13 = $r13, 3ld.w $r14 = 0[$r4]sub $r16 = $r6, 3shr $r15 = $r15, 9
add $r13 = $r3, $r0sub $r16 = $r6, 3;; ## end of 1st instr.shl $r13 = $r13, 3shr $r15 = $r15, 9ld.w $r14 = 0[$r4];; ## end of 2nd instr.
bundle
bundle
30
Schedule Construction and Instruction Latencies
Instruction latencies must be taken into account by the scheduler, but they’re not always fixed or the same for all ops
A scheduler can assume average or worst-case instruction latencies
Hide instruction latencies by ensuring that there is sufficient height between instruction issue and when result is needed to avoid pipeline stalls
Also recall the difference between the EQ versus the LEQ model
mul $r3 = $r3, $r1
add $r13 = $r2, $r3 ld.w $r14 = 0[$r5] add $r13 = $r13, $r14
ld.w $r15 = 0[$r6]
Takes 2 cyclesto complete
Takes >3 cycles(4 cycles ave.)
RAW hazards
Takes 1 cycleto complete
31
Linear Scheduling Techniques
Instruction compaction using linear-time scans over region:
As-soon-as-possible (ASAP) scheduling places ops in the earliest possible cycle using top-down scan
As-late-as-possible (ALAP) scheduling places ops in the latest possible cycle using bottom-up scan
Critical-path (CP) scheduling uses ASAP followed by ALAP
Resource hazard detection is local
mul $r3 = $r3, $r1add $r13 = $r2, $r3ld.w $r14 = 0[$r5]add $r13 = $r13, $r14ld.w $r15 = 0[$r6]
02031
cycle
mul $r3 = $r3, $r1ld.w $r14 = 0[$r5];;ld.w $r15 = 0[$r6];; add $r13 = $r2, $r3;; add $r13 = $r13, $r14;;
At most oneload per inst.
32
List Scheduling
List scheduling schedules operations from the global region based on a data dependence graph (DDG) or program dependence graph (PDG) which both have O(n2) complexity
Repeatedly selects an operation from a data-ready queue (DRQ), where an operation is ready when all if its DDG predecessors have been scheduled
for each root r in the PDG sorted by priority do enqueue(r)while DRQ is non-empty do h = dequeue() schedule(h) for each DAG successor s of h do if all predecessors of s have been scheduled then enqueue(s)
33
Data Dependence Graph
The data dependence graph (DDG) Nodes are operations Edges are RAW, WAR, and
WAW dependences
34
Control Flow Dependence
35
Compensation Code
Compensation code is needed when operations are scheduled across basic blocks in a region
Compensation code corrects scheduling changes by duplicating code on entries and exits from a scheduled region
A
B
X
C
Y
Schedulerinterchanges
A with B
entry
exit
Entry and/or exitmust be compensated
36
No Compensation
No compensation code is needed when block B does not have an entry and exitB
A
X
C
Y
A
B
X
C
Y
37
Join Compensation
Join compensation is applied when block B has an entry
Duplicate block BB
A
X
C
Y
A
B
X
C
Y
B’
Z Z
38
Split Compensation
Split compensation is applied when block B has an exit
Duplicate block AB
A
X
C
Y
A
B
X
C
Y
A’
W W
39
Join-Split Compensation
Join-split compensation is applied when block B has an entry and an exit
Duplicate block A and BB
A
X
C
Y
A
B
X
C
Y
A’W
W
B’Z
Z
W
40
Resource Management with Reservation Tables
A resource reservation table records which resources are busy per cycle
Reservation tables allow easy scheduling of operations by matching the operation’s required resources to empty slots
Construction of reservation table at a join point in the CFG is constructed by merging busy slots from both branches
CycleInteger
ALUFP
ALUMEM Branch
0 busy busy
1 busy busy
2 busy
3 busy busy
41
SoftwarePipelining
DO i = 0, 6 A B C D E F G HENDDO
Assuming that the initiationinterval (II) is 3 cycles
prologue
epilogue
kernel
42
Software Pipelining Example
> 3 cycles
> 2 cycles
>1 cycle
43
Modulo Scheduling
DDG
MRT
44
Constructing Kernel-Only Code by Predicate Register Rotation
BRT branches to the top and rotates the predicate registers:p1 = p0, p2 = p1, p3 = p2, p0 = p3
45
Modulo Variable Expansion (1)
46
Modulo Variable Expansion (2)