ILP: VLIW Architectures
Post on 25-Jan-2016
79 Views
Preview:
DESCRIPTION
Transcript
POLITECNICO DI MILANOParallelism in wonderland:
are you ready to see how deep the rabbit hole goes?
ILP: ILP: VLIW ArchitecturesVLIW Architectures
Marco D. Santambrogio: marco.santambrogio@polimi.it
Simone Campanoni: xan@eecs.harvard.edu
2
OutlineOutline
Introduction to ILP (a short reminder)
VLIW architecture
Beyond CPI = 1Beyond CPI = 1
Initial goal to achieve CPI = 1Can we improve beyond this?
Two approachesSuperscalar and VLIW
Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo)e.g. IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000The successful approach (to date) for general purpose computing
- 3 -
Beyond CPI = 1Beyond CPI = 1
(Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templatesCurrently found more success in DSP, Multimedia applicationsJoint HP/Intel agreement in 1999/2000Intel Architecture-64 (Merced/A-64) 64-bit address Style: “Explicitly Parallel Instruction Computer (EPIC)”
But first a little context ….
- 4 -
5 5
Definition of ILPDefinition of ILP
ILP = Potential overlap of execution among unrelated instructions
Overlapping possible if:No Structural HazardsNo RAW, WAR of WAW StallsNo Control Stalls
6
Instruction Level Instruction Level ParallelismParallelism
Two strategies to support ILP:
Dynamic Scheduling: Depend on the hardware to locate parallelism
Static Scheduling: Rely on software for identifying potential parallelism
Hardware intensive approaches dominate desktop and server markets
7
Dynamic SchedulingDynamic Scheduling
The hardware reorder the instruction execution to reduce pipeline stalls while maintaining data flow and exception behavior.Main advantages:
It enables handling some cases where dependences are unknown at compile timeIt simplifies the compiler complexityIt allows compiled code to run efficiently on a different pipeline.
Those advantages are gained at a cost of a significant increase in hardware complexity and power consumption.
8
Static SchedulingStatic Scheduling Compilers can use sophisticated algorithms for code scheduling to exploit ILP (Instruction Level Parallelism).
The amount of parallelism available within a basic block – a straight-line code sequence with no branches in except to the entry and no branches out except at the exit – is quite small.
Data dependence can further limit the amount of ILP we can exploit within a basic block to much less than the average basic block size.
To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks (i.e. across branches).
9
Static detection and resolution of dependences ( static scheduling): accomplished by the compiler dependences are avoided by code reordering. Output of the compiler: reordered into dependency-free code.
Typical example: VLIW (Very Long Instruction Word) processors expect dependency-free code.
Detection and resolution of dependences: Detection and resolution of dependences: Static SchedulingStatic Scheduling
Very Long Instruction Word Very Long Instruction Word ArchitecturesArchitectures
Processor can initiate multiple operations per cycle
Specified completely by the compiler (!like superscalar)
Low hardware complexity (no scheduling hardware, reduced support of variable latency instructions)
No instruction reordering performed by the hardware
Explicit parallelism
Single control flow1010
11
VLIW: Very Long Instruction WordVLIW: Very Long Instruction Word
Multiple operations packed into one instructionEach operation slot is for a fixed functionConstant operation latencies are specifiedArchitecture requires guarantee of:
Parallelism within an instruction => no x-operation RAW checkNo data use before data ready => no data interlocks
Two Integer Units,Single Cycle Latency
Two Load/Store Units,Three Cycle LatencyTwo Floating-Point Units,
Four Cycle Latency
Int Op 2
Mem Op 1 Mem Op 2 FP Op 1 FP Op 2Int Op 1
12
VLIW Compiler ResponsibilitiesVLIW Compiler Responsibilities
The compiler:Schedules to maximize parallel execution
Exploit ILP and LLPIt is necessary to map the instructions over the machine functional unitsThis mapping must account for time constraints and dependencies among the tasks
Guarantees intra-instruction parallelismSchedules to avoid data hazards (no interlocks)
Typically separates operations with explicit NOPs
The goal is to minimize the total execution time for the program
A VLIW Machine A VLIW Machine ConfigurationConfiguration
13
VLIW: Pros and ConsVLIW: Pros and Cons
ProsSimple HW
Easy to extend the #FUs
Good compilers can effectively detect parallelism
ConsHuge number of registers to keep active the FUs
Needed to store operands and results
Large data transport capacity betweenFUs and register filesRegister files and Memory
High bandwidth between i-cache and fetch unitLarge code size
14
15
Loop ExecutionLoop Execution
for (i=0; i<N; i++)
B[i] = A[i] + C;Int1
Int 2
M1 M2 FP+ FPx
loop:
How many FP ops/cycle?
ld add r1
fadd
sd add r2 bne
1 fadd / 8 cycles = 0.125
loop: ld f1, 0(r1)
add r1, 8
fadd f2, f0, f1
sd f2, 0(r2)
add r2, 8
bne r1, r3, loop
Compile
Schedule
16
Loop UnrollingLoop Unrolling
for (i=0; i<N; i++)
B[i] = A[i] + C;
for (i=0; i<N; i+=4)
{
B[i] = A[i] + C;
B[i+1] = A[i+1] + C;
B[i+2] = A[i+2] + C;
B[i+3] = A[i+3] + C;
}
Unroll inner loop to perform 4 iterations at once
Need to handle values of N that are not multiples of unrolling factor with final cleanup loop
17
Scheduling Loop Unrolled CodeScheduling Loop Unrolled Codeloop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1)
add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) sd f8, 24(r2)add r2, 32 bne r1, r3, loop
Schedule
Int1Int 2
M1 M2 FP+ FPx
loop:
Unroll 4 ways
ld f1
ld f2
ld f3
ld f4add r1 fadd f5
fadd f6
fadd f7
fadd f8
sd f5
sd f6
sd f7
sd f8add r2 bne
How many FLOPS/cycle? 4 fadds / 11 cycles = 0.36
18
Software PipeliningSoftware Pipelining
loop: ld f1, 0(r1) ld f2, 8(r1) ld f3, 16(r1) ld f4, 24(r1)
add r1, 32 fadd f5, f0, f1 fadd f6, f0, f2 fadd f7, f0, f3 fadd f8, f0, f4 sd f5, 0(r2) sd f6, 8(r2) sd f7, 16(r2) add r2, 32 sd f8, -8(r2) bne r1, r3, loop
Int1Int 2
M1 M2 FP+ FPxUnroll 4 ways firstld f1
ld f2
ld f3
ld f4
fadd f5
fadd f6
fadd f7
fadd f8
sd f5
sd f6
sd f7
sd f8
add r1
add r2
bne
ld f1
ld f2
ld f3
ld f4
fadd f5
fadd f6
fadd f7
fadd f8
sd f5
sd f6
sd f7
sd f8
add r1
add r2
bne
ld f1
ld f2
ld f3
ld f4
fadd f5
fadd f6
fadd f7
fadd f8
sd f5
add r1
loop:iterate
prolog
epilog
How many FLOPS/cycle?4 fadds / 4 cycles = 1
19
Software Pipelining vs. Loop Software Pipelining vs. Loop UnrollingUnrolling
time
performance
time
performance
Loop Unrolled
Software Pipelined
Startupoverhead
Wind-down overhead
Loop Iteration
Loop IterationSoftware pipelining pays startup/wind-down costs only once per loop, not once per iteration
What followsWhat follows
20
top related