Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Lecture 4Instruction Level Parallelism (2)

EEC 171 Parallel ArchitecturesJohn Owens

UC Davis

Credits• © John Owens / UC Davis 2007–9.

• Thanks to many sources for slide material: Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Wen-Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–6, © John Lazzaro / UCB 2006, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998.

Today’s Goals

• Out-of-order execution

• An alternate approach to machine parallelism: software scheduling & VLIW

• How do we ensure we have ample instruction-level parallelism?

• Branch prediction

Pentium Retrospective

• Limited in performance by “front end”

• Has to support variable-length instrsand segments

• Supporting all x86 features tough!

• 30% of transistors are for legacy support

• Up to 40% in Pentium Pro!

• Down to 10% in P4

• Microcode ROM is huge

Commit Unit

Fetch

ALU 2

Front End

Execution

Core

ALU 1

Execute Execute

Write-back

Issue

Complete

Decode/Dispatch

Pentium Retrospective

• Pentium is in-order issue, in-order complete

• “Static scheduling” by the dispatch logic:

• Fetch/dispatch/execute/retire: all in order

• Drawbacks:

• Adapts poorly to dynamic code stream

• Adapts poorly to future hardware

• What if we had 3 pipes not 2?

Commit Unit

Fetch

ALU 2

Front End

Execution

Core

ALU 1

Execute Execute

Write-back

Issue

Complete

Decode/Dispatch

Multiple-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software fixes, the fundamental limitations of

• Storage (data) dependencies—aka data hazards

• Most instruction streams do not have huge ILP so …

• ... this limits performance in a superscalar processor



• Procedural dependencies—aka control hazards

• Ditto, but even more severe

• Use dynamic branch prediction to help resolve the ILP issue

• Future lecture



• Resource conflicts—aka structural hazards

• A SS/VLIW processor has a much larger number of potential resource conflicts

• Functional units may have to arbitrate for result buses and register-file write ports

• Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource

Instruction Issue and Completion Policies

• Instruction-issue—initiate execution

• Instruction lookahead capability—fetch, decode and issue instructions beyond the current instruction

• Instruction-completion—complete execution

• Processor lookahead capability—complete issued instructions beyond the current instruction

• Instruction-commit—write back results to the RegFile or D$ (i.e., change the machine state)

In-order issue with in-order completionIn-order issue with out-of-order completion

Out-of-order issue with out-of-order completion and in-order commitOut-of-order issue with out-of-order completion

In-Order Issue with In-Order Completion

• Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order)

In-Order Issue with In-Order Completion (Ex.)

• Assume a pipelined processor that can fetch and decode two instructions per cycle, that has three functional units (a single cycle adder, a single cycle shifter, and a two cycle multiplier), and that can complete (and write back) two results per cycle

• Instruction sequence:I1: needs two execute cycles (a multiply)I2I3I4: needs the same function unit as I3I5: needs data value produced by I4I6: needs the same function unit as I5

In-Order Issue, In-Order Completion Example

EXIFID

WBInstr.

Order

I1

I2

I3

I4

I5

I6EXIF

IDWB

EX

EX WB

EX WB

EXIFID

WB

EXIFID

WB

IFID

IFID

I1: two execute cyclesI2I3

I4: same function unit as I3I5: data value produced by I4I6: same function unit as I5

In parallel canFetch/decode 2

Commit 2

IFID

IFID

IFID

In-Order Issue with Out-of-Order Completion

• With out-of-order completion, a later instruction may complete before a previous instruction

• Out-of-order completion is used in single-issue pipelined processors to improve the performance of long-latency operations such as divide

• When using out-of-order completion instruction issue is stalled when there is a resource conflict (e.g., for a functional unit) or when the instructions ready to issue need a result that has not yet been computed

IOI-OOC Example

EXIFID

WBInstr.

Order

I1

I2

I3

I4

I5

I6

EXIFID

WB

EXEXIF

IDWB

EXIFID

WB

EXIFID

WB

EXIFID

WB

IFID

IFID



Handling Output Dependencies

• There is one more situation that stalls instruction issuing with IOI-OOC, assume

• I1 – writes to R3I2 – writes to R3I5 – reads R3

• If the I1 write occurs after the I2 write, then I5 reads an incorrect value for R3

• I2 has an output dependency on I1—write before write

• The issuing of I2 would have to be stalled if its result might later be overwritten by an previous instruction (i.e., I1) that takes longer to complete—the stall happens before instruction issue

• While IOI-OOC yields higher performance, it requires more dependency checking hardware (both read-before-write and write-before-write)

Out-of-Order Issue with Out-of-Order Completion

• With in-order issue the processor stops decoding instructions whenever a decoded instruction has a resource conflict or a data dependency on an issued, but uncompleted instruction

• The processor is not able to look beyond the conflicted instruction even though more downstream instructions might have no conflicts and thus be issueable

• Fetch and decode instructions beyond the conflicted one (“instruction window”: Tetris), store them in an instruction buffer (as long as there’s room), and flag those instructions in the buffer that don’t have resource conflicts or data dependencies

• Flagged instructions are then issued from the buffer without regard to their program order

OOI-OOC Example

EXIFID

WBInstr.

Order

I1

I2

I3

I4

I5

I6EXIF

IDWB

EXEXIF

IDWB

EXIFID

WB

EXIFID

WB

EXIFID

WB



IFID

IFID

Dependency Examples

• R3 := R3 * R5 True data dependency (RAW)R4 := R3 + 1 Output dependency (WAW)R3 := R5 + 1 Antidependency (WAR)

Antidependencies• With OOI also have to deal with data

antidependencies – when a later instruction (that completes earlier) produces a data value that destroys a data value used as a source in an earlier instruction (that issues later)

• The constraint is similar to that of true data dependencies, except reversed

• Instead of the later instruction using a value (not yet) produced by an earlier instruction (read before write), the later instruction produces a value that destroys a value that the earlier instruction (has not yet) used (write before read)

Dependencies Review• Each of the three data dependencies …

• True data dependencies (read before write)

• Antidependencies (write before read)

• Output dependencies (write before write)

• … manifests itself through the use of registers (or other storage locations)

• True dependencies represent the flow of data and information through a program

• Anti- and output dependencies arise because the limited number of registers mean that programmers reuse registers for different computations

• When instructions are issued out-of-order, the correspondence between registers and values breaks down and the values conflict for registers

storage conflicts

Storage Conflicts and Register Renaming

• Storage conflicts can be reduced (or eliminated) by increasing or duplicating the troublesome resource

• Provide additional registers that are used to reestablish the correspondence between registers and values

• Allocated dynamically by the hardware in SS processors

• Register renaming — the processor renames the original register identifier in the instruction to a new register (one not in the visible register set)

• R3 := R3 * R5 R3b := R3a * R5aR4 := R3 + 1 R4a := R3b + 1R3 := R5 + 1 R3c := R5a + 1

• The hardware that does renaming assigns a “replacement” register from a pool of free registers and releases it back to the pool when its value is superseded and there are no outstanding references to it [future lecture!]

Pentium Pro

Commit

Re-order Buffer(ROB)

Commitment Unit

Execution Core

Front End

Floating-PointUnit

SIUCIU

Load-Store Unit

Reorder Buffer (ROB)

Integer Unit

��

Store

Data

Store

Addr.

Load

Addr.

Translate x86/Decode

BranchUnit

BPU

Instruction Fetch

Memory Access UnitsScalar ALUs

FPU

BranchUnit

��

BU

Reservation Station (RS)

uops

Pentium Pro

1. Fetch In order

2. Decode/dispatch In order

3. Issue Reorder

4. Execute Out of order

5. Complete Reorder

6. Writeback (commit) In orderCommit Unit

Fetch

ALU 2

Front End

Execution

Core

ALU 1

Execute Execute

Write-back

Issue

Complete

Decode/Dispatch

P6 Pipeline• Instruction fetch, BTB access (3.5 stages)

• 2 cycles for instruction fetch

• Decode, x86->uops (2.5 stages)

• Register rename (1 stage)

• Write to reservation station (1 stage)

• Read from reservation station (1 stage)

• Execute (1+ stages)

• Commit (2 stages)

Pentium Pro backends

• Pentium Pro

• Pentium 2

• Pentium 3

��

Floating-PointUnit

SIUCIU

Load-Store UnitInteger Unit

Port 0 Port 1 Port 4 Port 3 Port 2Port 0

Store

Data

Store

Addr.

Load

Addr.

Memory Access UnitsScalar ALUs

FPU

BranchUnit

Port 1

BU


Execution Core

MMX Unit

MMX 1MMX 0

Floating-PointUnit

SIUCIU


Port 0 Port 1Port 0 Port 1 Port 4 Port 3 Port 2Port 0

Store

Data

Store

Addr.

Load

Addr.

Memory Access UnitsScalar ALUsVector ALUs

FPU

BranchUnit

Port 1

BU


Execution Core

MMX/SSE UnitFP/SSE

Unit

SIUCIU


Port 0 Port 1Port 1 Port 1Port 0 Port 4 Port 3 Port 2Port 0

Store

Data

Store

Addr.

Load

Addr.

Memory Access UnitsScalar ALUsVector ALUs

BranchUnit

Port 1

BU


FPU &

VFMULMMX 1MMX 0

VFADD

VSHUFF

VRECIP

Where Do We Get ILP?

• All of these techniques require that we have ample instruction level parallelism

• Original P4 has 20 stages, 6 µops per cycle

• Lots of instructions in flight!

Hardware limits to superpipelining?

courtesy François Labonte, Stanford

Historicallimit:about

12

CPU Clock Periods (FO4)1985–2005

MIPS 20005 stages

Pentium 420 stages

Pentium Pro

10 stages

Power wall:Intel Core Duo has

14 stages

VLIW Beginnings• VLIW: Very Long Instruction Word

• Josh Fisher: idea grew out of his Ph.D (1979) in compilers

• Led to a startup (MultiFlow) whose computers worked, but which went out of business ... the ideas remain influential.

History of VLIW Processors• Started with (horizontal) microprogramming

• Very wide microinstructions used to directly generate control signals in single-issue processors (e.g., IBM 360 series)

• VLIW for multi-issue processors first appeared in the Multiflow and Cydrome (in the early 1980’s)

• Current commercial VLIW processors

• Intel i860 RISC (dual mode: scalar and VLIW)

• Intel I-64 (EPIC: Itanium and Itanium 2) [future lecture]

• Transmeta Crusoe

• Lucent/Motorola StarCore, ADI TigerSHARC, Infineon (Siemens) Carmel

Static Multiple Issue Machines (VLIW)

• Static multiple-issue processors (aka VLIW) use the compiler to decide which instructions to issue and execute simultaneously

• Issue packet—the set of instructions that are bundled together and issued in one clock cycle—think of it as one large instruction with multiple operations

• The mix of instructions in the packet (bundle) is usually restricted—a single “instruction” with several predefined fields

• The compiler does static branch prediction and code scheduling to reduce (ctrl) or eliminate (data) hazards

Static Multiple Issue Machines (VLIW)

• VLIW’s have

• Multiple functional units (like SS processors)

• Multi-ported register files (again like SS processors)

• Wide program bus

An Example: A VLIW MIPS

• Consider a 2-issue MIPS with a 2 instr bundle

• Instructions are always fetched, decoded, and issued in pairs

• If one instr of the pair can not be used, it is replaced with a noop

• Need 4 read ports and 2 write ports and a separate memory address adder

ALU Op (R format)or

Branch (I format)

Load or Store (I format)

64 bits

A MIPS VLIW (2-issue) Datapath

InstructionMemory

Add

PC

4

Write Data

Write Addr

RegisterFile

ALU

Add

DataMemory

SignExtend

Add

SignExtend

No hazard hardware (so no load use allowed)

Let’s say we wanted more functional

units. What would need to change?

Code Scheduling Example• Consider the following loop code:

lp: lw $t0,0($s1) # $t0=array element addu $t0,$t0,$s2 # add scalar in $s2 sw $t0,0($s1) # store result addi $s1,$s1,-4 # decrement pointer bne $s1,$0,lp # branch if $s1 != 0

• Must “schedule” the instructions to avoid pipeline stalls

• Instructions in one bundle must be independent

• Must separate load use instructions from their loads by one cycle

• Notice that the first two instructions have a load use dependency, the next two and last two have data dependencies

• Assume branches are perfectly predicted by the hardware

The Scheduled Code (Not Unrolled)

• How many clock cycles?

• How many instructions?

• CPI? Best case?

• IPC? Best case?

ALU or branch Data transfer CClp: 1

2345

Loop Unrolling• Loop unrolling—multiple copies of the loop body are

made and instructions from different iterations are scheduled together as a way to increase ILP

• Apply loop unrolling (4 times for our example) and then schedule the resulting code

• Eliminate unnecessary loop overhead instructions

• Schedule so as to avoid load use hazards

• During unrolling the compiler applies register renaming to eliminate all data dependencies that are not true dependencies

Unrolled Code Example• lp: lw $t0,0($s1) # $t0=array element

lw $t1,-4($s1) # $t1=array element lw $t2,-8($s1) # $t2=array element lw $t3,-12($s1) # $t3=array element addu $t0,$t0,$s2 # add scalar in $s2 addu $t1,$t1,$s2 # add scalar in $s2 addu $t2,$t2,$s2 # add scalar in $s2 addu $t3,$t3,$s2 # add scalar in $s2 sw $t0,0($s1) # store result sw $t1,-4($s1) # store result sw $t2,-8($s1) # store result sw $t3,-12($s1) # store result addi $s1,$s1,-16 # decrement pointer bne $s1,$0,lp # branch if $s1 != 0

The Scheduled Code (Unrolled)

• Eight clock cycles to execute 14 instructions for a

• CPI of 0.57 (versus the best case of 0.5)

• IPC of 1.8 (versus the best case of 2.0)

ALU or branch Data transfer CClp: addi $s1,$s1,-16 lw $t0,0($s1) 1

lw $t1,12($s1) 2addu $t0,$t0,$s2 lw $t2,8($s1) 3addu $t1,$t1,$s2 lw $t3,4($s1) 4addu $t2,$t2,$s2 sw $t0,16($s1) 5addu $t3,$t3,$s2 sw $t1,12($s1) 6

sw $t2,8($s1) 7bne $s1,$0,lp sw $t3,4($s1) 8

What does N = 14 assembly look like?

• Two instructions from a scientific benchmark (Linpack) for a MultiFlow CPU with 14 operations per instruction.

Defining Attributes of VLIW

• Compiler:

• 1. MultiOp: instruction containing multiple independent operations

• 2. Specified number of resources of specified types

• 3. Exposed, architectural latencies

Add Add Mpy Mem Mem

Register File

add nop nop load store

VLIW instruction =5 independent

operations

Icache

Compiler Support for VLIW Processors

• The compiler packs groups of independent instructions into the bundle

• Because branch prediction is not perfect, done by code re-ordering (trace scheduling)

• We’ll cover this in a future lecture

• The compiler uses loop unrolling to expose more ILP

• The compiler uses register renaming to solve name dependencies and ensures no load use hazards occur

Compiler Support for VLIW Processors

• While superscalars use dynamic prediction, VLIW’s primarily depend on the compiler for extracting ILP

• Loop unrolling reduces the number of conditional branches

• Predication eliminates if-the-else branch structures by replacing them with predicated instructions

• We’ll cover this in a future lecture as well

• The compiler predicts memory bank references to help minimize memory bank conflicts

VLIW Advantages

• Advantages

• Simpler hardware (potentially less power hungry)

• Potentially more scalable

• Allow more instr’s per VLIW bundle and add more FUs

VLIW Disadvantages• Programmer/compiler complexity and longer compilation times

• Deep pipelines and long latencies can be confusing (making peak performance elusive)

• Lock step operation, i.e., on hazard all future issues stall until hazard is resolved (hence need for predication)

• Object (binary) code incompatibility

• Needs lots of program memory bandwidth

• Code bloat

• Noops are a waste of program memory space

• Loop unrolling to expose more ILP uses more program memory space

Review: Multi-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software

• Data dependencies – aka data hazards

• True data dependencies (read after write)

• Use data forwarding hardware

• Use compiler scheduling

• Storage dependence (aka name dependence)

• Use register renaming to solve both

• Antidependencies (write after read)

• Output dependencies (write after write)



• Procedural dependencies – aka control hazards

• Use aggressive branch prediction (speculation)

• Use predication

• Future lecture



• Resource conflicts—aka structural hazards

• Use resource duplication or resource pipelining to reduce (or eliminate) resource conflicts

• Use arbitration for result and commit buses and register file read and write ports

Review: Multiple-Issue Processor Styles

• Dynamic multiple-issue processors (aka superscalar)

• Decisions on which instructions to execute simultaneously (in the range of 2 to 8 in 2005) are being made dynamically (at run time by the hardware)

• E.g., IBM Power 2, Pentium 4, MIPS R10K, HP PA 8500 IBM

• Static multiple-issue processors (aka VLIW)

• Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler)

• E.g., Intel Itanium and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer)

• 128 bit “bundles” containing 3 instructions each 41 bits + 5 bit template field (specifies which FU each instr needs)

• Five functional units (IntALU, MMedia, DMem, FPALU, Branch)

• Extensive support for speculation and predication

CISC vs RISC vs SS vs VLIWCISC RISC Super-

scalar VLIW

Instr size

Instr format

Registers

Memory reference

Key Issues

Instruction flow

IF ID EX M WBIF ID EX M WB IF ID EX M WBEX M WB

IF ID EX M WB

IF ID EX M WB

IF ID EX M WB

IF ID EX M WBIF ID EX M WB

IF ID EX M WBEX M WB

What is a basic block?

• “Experiments and experience indicated that only a factor of 2 to 3 speedup from parallelism was available within basic blocks. (A basic block of code has no jumps in except at the beginning and no jumps out except at the end.)” — “Very Long Instruction Word Architectures and the ELI-512”, Joseph A. Fisher

Branches Limit ILP

• Programs average about 5 instructions between branches

• Can’t issue instructions if you don’t know where the program is going

• Current processors issue 4–6 operations/cycle

• Conclusion: Must exploit parallelism across multiple basic blocks

Branch Prediction Matters• 21264:

• (From Ranganathan and Jouppi, via Dan Connors)

Benchmark Misprediction Rate

Performance Penalty

go 16.5% 40%

compress 9% 30%

gcc 7% 20%

0

17.5

35.0

52.5

70.0

gcc espresso li fpppp doducd tomcatv

Perfect Selective predictor Standard 2-bitStatic None

Branch Prediction ImpactFP: 15–45

Integer: 6–12

IPC

Compiler: Static Prediction

• Predict at compile time whether branches will be taken before execution

• Schemes

• Predict taken

• Would be hard to squeeze into our pipeline

• Can’t compute target until ID

Compiler: Static Prediction

• Predict at compile time whether branches will be taken before execution

• Schemes

• Backwards taken, forwards not taken

• Why is this a good idea?

Compiler: Static Prediction• Predict at compile time whether branches will be

taken before execution

• Schemes

• Predict taken

• Backwards taken, forwards not taken (good performance for loops)

• No run-time adaptation: bad performance for data-dependent branches

• if (a == 0) b =3; else b=4;

Hardware-based Dynamic Branch Prediction

• Single level (Simple counters) – predict outcome based on past branch behavior

• FSM (Finite State Machine)

• Global Branch Correlation – track relations between branches

• GAs

• Gshare

• Local Correlation – predict outcome based on past branch behavior PATTERN

• PAs

• Hybrid predictors (combination of local and global)

• Miscellaneous

• Return Address Stack (RAS)

• Indirect jump prediction

Mis-prediction Detections and Feedbacks

• Detections:

• At the end of decoding

• Target address known at decoding, and does not match

• Flush fetch stage

• At commit (most cases)

• Wrong branch direction or target address does not match

• Flush the whole pipeline

• Feedbacks:

• Any time a mis-prediction is detected

• At a branch’s commit

FETCH

DECODE

SCHD

REB/ROB

COMMIT

WB

EXE

predictors

1-bit “Self Correlating” Predictor

• Let’s consider a simple model. Store a bit per branch: “last time, was the branch taken or not”.

• Consider a loop of 10 iterations before exit:

• for (…) for (i=0; i<10; i++) a[i] = a[i] * 2.0;

• What’s the accuracy of this predictor?

Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of misprediction)

• Branch History Table: Lower bits of PC address index table of 1-bit values

• Says whether or not branch taken last time

• No address check

• Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit):

• End of loop case, when it exits instead of looping as before

• First time through loop on next time through code, when it predicts exit instead of looping

Predictor for a Single Branch

state 2. PredictOutput T/NT

1. Access PC

3. Feedback T/NT

T

Predict Taken Predict Taken1 0T

NT

General Form

1-bit prediction

NT

Feedback

• Solution: 2-bit scheme where change prediction only if get misprediction twice:

• Red: stop, not taken

• Green: go, taken

• Adds hysteresis to decision making process

Dynamic Branch Prediction (Jim Smith, 1981)

T

T NT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NTT

NT

Simple (“2-bit”) Branch History Table Entry

D Q

Prediction for next branch.(1 = take, 0 = not take)

Initialize to 0.

D Q

Was last prediction correct?(1 = yes, 0 = no)

Initialize to 1.

Set to 1 if prediction bit was correct. Set to 0 if prediction bit was

incorrect.Set to 1 if prediction bit flips.

Flip bit if prediction is not correct and “last predict

correct” bit is 0.

After we “check” prediction ...

Branch prediction hardware• Branch Target Buffer (BTB): Address of branch index to

get prediction AND branch address (if taken)

• Note: must check for branch match now, since can’t use wrong branch address

Branch PC Predicted PC

=?

PC of instructionFETCH

Predict taken or untakenAddress of next

instruction fetched

Valid Branch(BTB knows about branch)

Some Interesting Patterns• Format: Not-taken (N) (0), Taken (T)(1)

• TTTTTTTTTT

• 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 … : Should give perfect prediction

• NNTTNNTTNNTT

• 0 0 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 0 0 1 0 … : Will mispredict 1/2 of the time

• N*N[TNTN]

• 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 … : Should alternate incorrectly

• N*T[TNTN]

• 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 … : Should alternate incorrectly

Pentium 4 Branch Prediction• Critical to Performance

• 20 cycle penalty for misprediction

• Branch Target Buffer

• 2048 entries

• 12 bits of history

• Adaptive algorithm

• Can recognize repeated patterns, e.g., alternating taken–not taken

• Handling BTB misses

• Detect in cycle 6

• Predict taken for negative offset, not taken for positive (why?)

Branch Prediction Summary• Consider for each branch prediction

• Hardware cost

• Prediction accuracy

• Warm-up time

• Correlation

• Interference

• Time to generate prediction

• Application behavior determines number of branches

• More control intensive program…more opportunity to mispredict

• What if a compiler/architecture could eliminate branches?

Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Documents