Top Banner
Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures John Owens UC Davis
67

Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Mar 07, 2019

Download

Documents

hoangxuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Lecture 4Instruction Level Parallelism (2)

EEC 171 Parallel ArchitecturesJohn Owens

UC Davis

Page 2: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Credits• © John Owens / UC Davis 2007–9.

• Thanks to many sources for slide material: Computer Organization and Design (Patterson & Hennessy) © 2005, Computer Architecture (Hennessy & Patterson) © 2007, Inside the Machine (Jon Stokes) © 2007, © Dan Connors / University of Colorado 2007, © Wen-Mei Hwu/David Kirk, University of Illinois 2007, © David Patterson / UCB 2003–6, © John Lazzaro / UCB 2006, © Mary Jane Irwin / Penn State 2005, © John Kubiatowicz / UCB 2002, © Krste Asinovic/Arvind / MIT 2002, © Morgan Kaufmann Publishers 1998.

Page 3: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Today’s Goals

• Out-of-order execution

• An alternate approach to machine parallelism: software scheduling & VLIW

• How do we ensure we have ample instruction-level parallelism?

• Branch prediction

Page 4: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Pentium Retrospective

• Limited in performance by “front end”

• Has to support variable-length instrsand segments

• Supporting all x86 features tough!

• 30% of transistors are for legacy support

• Up to 40% in Pentium Pro!

• Down to 10% in P4

• Microcode ROM is huge

Commit Unit

Fetch

ALU 2

Front End

Execution

Core

ALU 1

Execute Execute

Write-back

Issue

Complete

Decode/Dispatch

Page 5: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Pentium Retrospective

• Pentium is in-order issue, in-order complete

• “Static scheduling” by the dispatch logic:

• Fetch/dispatch/execute/retire: all in order

• Drawbacks:

• Adapts poorly to dynamic code stream

• Adapts poorly to future hardware

• What if we had 3 pipes not 2?

Commit Unit

Fetch

ALU 2

Front End

Execution

Core

ALU 1

Execute Execute

Write-back

Issue

Complete

Decode/Dispatch

Page 6: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Multiple-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software fixes, the fundamental limitations of

• Storage (data) dependencies—aka data hazards

• Most instruction streams do not have huge ILP so …

• ... this limits performance in a superscalar processor

Page 7: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Multiple-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software fixes, the fundamental limitations of

• Procedural dependencies—aka control hazards

• Ditto, but even more severe

• Use dynamic branch prediction to help resolve the ILP issue

• Future lecture

Page 8: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Multiple-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software fixes, the fundamental limitations of

• Resource conflicts—aka structural hazards

• A SS/VLIW processor has a much larger number of potential resource conflicts

• Functional units may have to arbitrate for result buses and register-file write ports

• Resource conflicts can be eliminated by duplicating the resource or by pipelining the resource

Page 9: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Instruction Issue and Completion Policies

• Instruction-issue—initiate execution

• Instruction lookahead capability—fetch, decode and issue instructions beyond the current instruction

• Instruction-completion—complete execution

• Processor lookahead capability—complete issued instructions beyond the current instruction

• Instruction-commit—write back results to the RegFile or D$ (i.e., change the machine state)

In-order issue with in-order completionIn-order issue with out-of-order completion

Out-of-order issue with out-of-order completion and in-order commitOut-of-order issue with out-of-order completion

Page 10: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

In-Order Issue with In-Order Completion

• Simplest policy is to issue instructions in exact program order and to complete them in the same order they were fetched (i.e., in program order)

Page 11: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

In-Order Issue with In-Order Completion (Ex.)

• Assume a pipelined processor that can fetch and decode two instructions per cycle, that has three functional units (a single cycle adder, a single cycle shifter, and a two cycle multiplier), and that can complete (and write back) two results per cycle

• Instruction sequence:I1: needs two execute cycles (a multiply)I2I3I4: needs the same function unit as I3I5: needs data value produced by I4I6: needs the same function unit as I5

Page 12: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

In-Order Issue, In-Order Completion Example

EXIFID

WBInstr.

Order

I1

I2

I3

I4

I5

I6EXIF

IDWB

EX

EX WB

EX WB

EXIFID

WB

EXIFID

WB

IFID

IFID

I1: two execute cyclesI2I3

I4: same function unit as I3I5: data value produced by I4I6: same function unit as I5

In parallel canFetch/decode 2

Commit 2

IFID

IFID

IFID

Page 13: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

In-Order Issue with Out-of-Order Completion

• With out-of-order completion, a later instruction may complete before a previous instruction

• Out-of-order completion is used in single-issue pipelined processors to improve the performance of long-latency operations such as divide

• When using out-of-order completion instruction issue is stalled when there is a resource conflict (e.g., for a functional unit) or when the instructions ready to issue need a result that has not yet been computed

Page 14: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

IOI-OOC Example

EXIFID

WBInstr.

Order

I1

I2

I3

I4

I5

I6

EXIFID

WB

EXEXIF

IDWB

EXIFID

WB

EXIFID

WB

EXIFID

WB

IFID

IFID

I1: two execute cyclesI2I3

I4: same function unit as I3I5: data value produced by I4I6: same function unit as I5

Page 15: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Handling Output Dependencies

• There is one more situation that stalls instruction issuing with IOI-OOC, assume

• I1 – writes to R3I2 – writes to R3I5 – reads R3

• If the I1 write occurs after the I2 write, then I5 reads an incorrect value for R3

• I2 has an output dependency on I1—write before write

• The issuing of I2 would have to be stalled if its result might later be overwritten by an previous instruction (i.e., I1) that takes longer to complete—the stall happens before instruction issue

• While IOI-OOC yields higher performance, it requires more dependency checking hardware (both read-before-write and write-before-write)

Page 16: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Out-of-Order Issue with Out-of-Order Completion

• With in-order issue the processor stops decoding instructions whenever a decoded instruction has a resource conflict or a data dependency on an issued, but uncompleted instruction

• The processor is not able to look beyond the conflicted instruction even though more downstream instructions might have no conflicts and thus be issueable

• Fetch and decode instructions beyond the conflicted one (“instruction window”: Tetris), store them in an instruction buffer (as long as there’s room), and flag those instructions in the buffer that don’t have resource conflicts or data dependencies

• Flagged instructions are then issued from the buffer without regard to their program order

Page 17: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

OOI-OOC Example

EXIFID

WBInstr.

Order

I1

I2

I3

I4

I5

I6EXIF

IDWB

EXEXIF

IDWB

EXIFID

WB

EXIFID

WB

EXIFID

WB

I1: two execute cyclesI2I3

I4: same function unit as I3I5: data value produced by I4I6: same function unit as I5

IFID

IFID

Page 18: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Dependency Examples

• R3 := R3 * R5 True data dependency (RAW)R4 := R3 + 1 Output dependency (WAW)R3 := R5 + 1 Antidependency (WAR)

Page 19: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Antidependencies• With OOI also have to deal with data

antidependencies – when a later instruction (that completes earlier) produces a data value that destroys a data value used as a source in an earlier instruction (that issues later)

• The constraint is similar to that of true data dependencies, except reversed

• Instead of the later instruction using a value (not yet) produced by an earlier instruction (read before write), the later instruction produces a value that destroys a value that the earlier instruction (has not yet) used (write before read)

Page 20: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Dependencies Review• Each of the three data dependencies …

• True data dependencies (read before write)

• Antidependencies (write before read)

• Output dependencies (write before write)

• … manifests itself through the use of registers (or other storage locations)

• True dependencies represent the flow of data and information through a program

• Anti- and output dependencies arise because the limited number of registers mean that programmers reuse registers for different computations

• When instructions are issued out-of-order, the correspondence between registers and values breaks down and the values conflict for registers

storage conflicts

Page 21: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Storage Conflicts and Register Renaming

• Storage conflicts can be reduced (or eliminated) by increasing or duplicating the troublesome resource

• Provide additional registers that are used to reestablish the correspondence between registers and values

• Allocated dynamically by the hardware in SS processors

• Register renaming — the processor renames the original register identifier in the instruction to a new register (one not in the visible register set)

• R3 := R3 * R5 R3b := R3a * R5aR4 := R3 + 1 R4a := R3b + 1R3 := R5 + 1 R3c := R5a + 1

• The hardware that does renaming assigns a “replacement” register from a pool of free registers and releases it back to the pool when its value is superseded and there are no outstanding references to it [future lecture!]

Page 22: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Pentium Pro

Commit

Re-order Buffer(ROB)

Commitment Unit

Execution Core

Front End

Floating-PointUnit

SIUCIU

Load-Store Unit

Reorder Buffer (ROB)

Integer Unit

���� ���� ���� ���� ���� ����

Store

Data

Store

Addr.

Load

Addr.

Translate x86/Decode

BranchUnit

BPU

Instruction Fetch

Memory Access UnitsScalar ALUs

FPU

BranchUnit

����

BU

Reservation Station (RS)

uops

Page 23: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Pentium Pro

1. Fetch In order

2. Decode/dispatch In order

3. Issue Reorder

4. Execute Out of order

5. Complete Reorder

6. Writeback (commit) In orderCommit Unit

Fetch

ALU 2

Front End

Execution

Core

ALU 1

Execute Execute

Write-back

Issue

Complete

Decode/Dispatch

Page 24: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

P6 Pipeline• Instruction fetch, BTB access (3.5 stages)

• 2 cycles for instruction fetch

• Decode, x86->uops (2.5 stages)

• Register rename (1 stage)

• Write to reservation station (1 stage)

• Read from reservation station (1 stage)

• Execute (1+ stages)

• Commit (2 stages)

Page 25: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Pentium Pro backends

• Pentium Pro

• Pentium 2

• Pentium 3

������������

Floating-PointUnit

SIUCIU

Load-Store UnitInteger Unit

Port 0 Port 1 Port 4 Port 3 Port 2Port 0

Store

Data

Store

Addr.

Load

Addr.

Memory Access UnitsScalar ALUs

FPU

BranchUnit

Port 1

BU

Reservation Station (RS)

Execution Core

MMX Unit

MMX 1MMX 0

Floating-PointUnit

SIUCIU

Load-Store UnitInteger Unit

Port 0 Port 1Port 0 Port 1 Port 4 Port 3 Port 2Port 0

Store

Data

Store

Addr.

Load

Addr.

Memory Access UnitsScalar ALUsVector ALUs

FPU

BranchUnit

Port 1

BU

Reservation Station (RS)

Execution Core

MMX/SSE UnitFP/SSE

Unit

SIUCIU

Load-Store UnitInteger Unit

Port 0 Port 1Port 1 Port 1Port 0 Port 4 Port 3 Port 2Port 0

Store

Data

Store

Addr.

Load

Addr.

Memory Access UnitsScalar ALUsVector ALUs

BranchUnit

Port 1

BU

Reservation Station (RS)

FPU &

VFMULMMX 1MMX 0

VFADD

VSHUFF

VRECIP

Page 26: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Where Do We Get ILP?

• All of these techniques require that we have ample instruction level parallelism

• Original P4 has 20 stages, 6 µops per cycle

• Lots of instructions in flight!

Page 27: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Hardware limits to superpipelining?

courtesy François Labonte, Stanford

Historicallimit:about

12

CPU Clock Periods (FO4)1985–2005

MIPS 20005 stages

Pentium 420 stages

Pentium Pro

10 stages

Power wall:Intel Core Duo has

14 stages

Page 28: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

VLIW Beginnings• VLIW: Very Long Instruction Word

• Josh Fisher: idea grew out of his Ph.D (1979) in compilers

• Led to a startup (MultiFlow) whose computers worked, but which went out of business ... the ideas remain influential.

Page 29: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

History of VLIW Processors• Started with (horizontal) microprogramming

• Very wide microinstructions used to directly generate control signals in single-issue processors (e.g., IBM 360 series)

• VLIW for multi-issue processors first appeared in the Multiflow and Cydrome (in the early 1980’s)

• Current commercial VLIW processors

• Intel i860 RISC (dual mode: scalar and VLIW)

• Intel I-64 (EPIC: Itanium and Itanium 2) [future lecture]

• Transmeta Crusoe

• Lucent/Motorola StarCore, ADI TigerSHARC, Infineon (Siemens) Carmel

Page 30: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Static Multiple Issue Machines (VLIW)

• Static multiple-issue processors (aka VLIW) use the compiler to decide which instructions to issue and execute simultaneously

• Issue packet—the set of instructions that are bundled together and issued in one clock cycle—think of it as one large instruction with multiple operations

• The mix of instructions in the packet (bundle) is usually restricted—a single “instruction” with several predefined fields

• The compiler does static branch prediction and code scheduling to reduce (ctrl) or eliminate (data) hazards

Page 31: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Static Multiple Issue Machines (VLIW)

• VLIW’s have

• Multiple functional units (like SS processors)

• Multi-ported register files (again like SS processors)

• Wide program bus

Page 32: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

An Example: A VLIW MIPS

• Consider a 2-issue MIPS with a 2 instr bundle

• Instructions are always fetched, decoded, and issued in pairs

• If one instr of the pair can not be used, it is replaced with a noop

• Need 4 read ports and 2 write ports and a separate memory address adder

ALU Op (R format)or

Branch (I format)

Load or Store (I format)

64 bits

Page 33: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

A MIPS VLIW (2-issue) Datapath

InstructionMemory

Add

PC

4

Write Data

Write Addr

RegisterFile

ALU

Add

DataMemory

SignExtend

Add

SignExtend

No hazard hardware (so no load use allowed)

Let’s say we wanted more functional

units. What would need to change?

Page 34: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Code Scheduling Example• Consider the following loop code:

lp: lw $t0,0($s1) # $t0=array element addu $t0,$t0,$s2 # add scalar in $s2 sw $t0,0($s1) # store result addi $s1,$s1,-4 # decrement pointer bne $s1,$0,lp # branch if $s1 != 0

• Must “schedule” the instructions to avoid pipeline stalls

• Instructions in one bundle must be independent

• Must separate load use instructions from their loads by one cycle

• Notice that the first two instructions have a load use dependency, the next two and last two have data dependencies

• Assume branches are perfectly predicted by the hardware

Page 35: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

The Scheduled Code (Not Unrolled)

• How many clock cycles?

• How many instructions?

• CPI? Best case?

• IPC? Best case?

ALU or branch Data transfer CClp: 1

2345

Page 36: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Loop Unrolling• Loop unrolling—multiple copies of the loop body are

made and instructions from different iterations are scheduled together as a way to increase ILP

• Apply loop unrolling (4 times for our example) and then schedule the resulting code

• Eliminate unnecessary loop overhead instructions

• Schedule so as to avoid load use hazards

• During unrolling the compiler applies register renaming to eliminate all data dependencies that are not true dependencies

Page 37: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Unrolled Code Example• lp: lw $t0,0($s1) # $t0=array element

lw $t1,-4($s1) # $t1=array element lw $t2,-8($s1) # $t2=array element lw $t3,-12($s1) # $t3=array element addu $t0,$t0,$s2 # add scalar in $s2 addu $t1,$t1,$s2 # add scalar in $s2 addu $t2,$t2,$s2 # add scalar in $s2 addu $t3,$t3,$s2 # add scalar in $s2 sw $t0,0($s1) # store result sw $t1,-4($s1) # store result sw $t2,-8($s1) # store result sw $t3,-12($s1) # store result addi $s1,$s1,-16 # decrement pointer bne $s1,$0,lp # branch if $s1 != 0

Page 38: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

The Scheduled Code (Unrolled)

• Eight clock cycles to execute 14 instructions for a

• CPI of 0.57 (versus the best case of 0.5)

• IPC of 1.8 (versus the best case of 2.0)

ALU or branch Data transfer CClp: addi $s1,$s1,-16 lw $t0,0($s1) 1

lw $t1,12($s1) 2addu $t0,$t0,$s2 lw $t2,8($s1) 3addu $t1,$t1,$s2 lw $t3,4($s1) 4addu $t2,$t2,$s2 sw $t0,16($s1) 5addu $t3,$t3,$s2 sw $t1,12($s1) 6

sw $t2,8($s1) 7bne $s1,$0,lp sw $t3,4($s1) 8

Page 39: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

What does N = 14 assembly look like?

• Two instructions from a scientific benchmark (Linpack) for a MultiFlow CPU with 14 operations per instruction.

Page 40: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Defining Attributes of VLIW

• Compiler:

• 1. MultiOp: instruction containing multiple independent operations

• 2. Specified number of resources of specified types

• 3. Exposed, architectural latencies

Add Add Mpy Mem Mem

Register File

add nop nop load store

VLIW instruction =5 independent

operations

Icache

Page 41: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Compiler Support for VLIW Processors

• The compiler packs groups of independent instructions into the bundle

• Because branch prediction is not perfect, done by code re-ordering (trace scheduling)

• We’ll cover this in a future lecture

• The compiler uses loop unrolling to expose more ILP

• The compiler uses register renaming to solve name dependencies and ensures no load use hazards occur

Page 42: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Compiler Support for VLIW Processors

• While superscalars use dynamic prediction, VLIW’s primarily depend on the compiler for extracting ILP

• Loop unrolling reduces the number of conditional branches

• Predication eliminates if-the-else branch structures by replacing them with predicated instructions

• We’ll cover this in a future lecture as well

• The compiler predicts memory bank references to help minimize memory bank conflicts

Page 43: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

VLIW Advantages

• Advantages

• Simpler hardware (potentially less power hungry)

• Potentially more scalable

• Allow more instr’s per VLIW bundle and add more FUs

Page 44: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

VLIW Disadvantages• Programmer/compiler complexity and longer compilation times

• Deep pipelines and long latencies can be confusing (making peak performance elusive)

• Lock step operation, i.e., on hazard all future issues stall until hazard is resolved (hence need for predication)

• Object (binary) code incompatibility

• Needs lots of program memory bandwidth

• Code bloat

• Noops are a waste of program memory space

• Loop unrolling to expose more ILP uses more program memory space

Page 45: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Review: Multi-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software

• Data dependencies – aka data hazards

• True data dependencies (read after write)

• Use data forwarding hardware

• Use compiler scheduling

• Storage dependence (aka name dependence)

• Use register renaming to solve both

• Antidependencies (write after read)

• Output dependencies (write after write)

Page 46: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Review: Multi-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software

• Procedural dependencies – aka control hazards

• Use aggressive branch prediction (speculation)

• Use predication

• Future lecture

Page 47: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Review: Multi-Issue Datapath Responsibilities

• Must handle, with a combination of hardware and software

• Resource conflicts—aka structural hazards

• Use resource duplication or resource pipelining to reduce (or eliminate) resource conflicts

• Use arbitration for result and commit buses and register file read and write ports

Page 48: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Review: Multiple-Issue Processor Styles

• Dynamic multiple-issue processors (aka superscalar)

• Decisions on which instructions to execute simultaneously (in the range of 2 to 8 in 2005) are being made dynamically (at run time by the hardware)

• E.g., IBM Power 2, Pentium 4, MIPS R10K, HP PA 8500 IBM

• Static multiple-issue processors (aka VLIW)

• Decisions on which instructions to execute simultaneously are being made statically (at compile time by the compiler)

• E.g., Intel Itanium and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer)

• 128 bit “bundles” containing 3 instructions each 41 bits + 5 bit template field (specifies which FU each instr needs)

• Five functional units (IntALU, MMedia, DMem, FPALU, Branch)

• Extensive support for speculation and predication

Page 49: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

CISC vs RISC vs SS vs VLIWCISC RISC Super-

scalar VLIW

Instr size

Instr format

Registers

Memory reference

Key Issues

Instruction flow

IF ID EX M WBIF ID EX M WB IF ID EX M WBEX M WB

IF ID EX M WB

IF ID EX M WB

IF ID EX M WB

IF ID EX M WBIF ID EX M WB

IF ID EX M WBEX M WB

Page 50: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

What is a basic block?

• “Experiments and experience indicated that only a factor of 2 to 3 speedup from parallelism was available within basic blocks. (A basic block of code has no jumps in except at the beginning and no jumps out except at the end.)” — “Very Long Instruction Word Architectures and the ELI-512”, Joseph A. Fisher

Page 51: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Branches Limit ILP

• Programs average about 5 instructions between branches

• Can’t issue instructions if you don’t know where the program is going

• Current processors issue 4–6 operations/cycle

• Conclusion: Must exploit parallelism across multiple basic blocks

Page 52: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Branch Prediction Matters• 21264:

• (From Ranganathan and Jouppi, via Dan Connors)

Benchmark Misprediction Rate

Performance Penalty

go 16.5% 40%

compress 9% 30%

gcc 7% 20%

Page 53: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

0

17.5

35.0

52.5

70.0

gcc espresso li fpppp doducd tomcatv

Perfect Selective predictor Standard 2-bitStatic None

Branch Prediction ImpactFP: 15–45

Integer: 6–12

IPC

Page 54: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Compiler: Static Prediction

• Predict at compile time whether branches will be taken before execution

• Schemes

• Predict taken

• Would be hard to squeeze into our pipeline

• Can’t compute target until ID

Page 55: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Compiler: Static Prediction

• Predict at compile time whether branches will be taken before execution

• Schemes

• Backwards taken, forwards not taken

• Why is this a good idea?

Page 56: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Compiler: Static Prediction• Predict at compile time whether branches will be

taken before execution

• Schemes

• Predict taken

• Backwards taken, forwards not taken (good performance for loops)

• No run-time adaptation: bad performance for data-dependent branches

• if (a == 0) b =3; else b=4;

Page 57: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Hardware-based Dynamic Branch Prediction

• Single level (Simple counters) – predict outcome based on past branch behavior

• FSM (Finite State Machine)

• Global Branch Correlation – track relations between branches

• GAs

• Gshare

• Local Correlation – predict outcome based on past branch behavior PATTERN

• PAs

• Hybrid predictors (combination of local and global)

• Miscellaneous

• Return Address Stack (RAS)

• Indirect jump prediction

Page 58: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Mis-prediction Detections and Feedbacks

• Detections:

• At the end of decoding

• Target address known at decoding, and does not match

• Flush fetch stage

• At commit (most cases)

• Wrong branch direction or target address does not match

• Flush the whole pipeline

• Feedbacks:

• Any time a mis-prediction is detected

• At a branch’s commit

FETCH

DECODE

SCHD

REB/ROB

COMMIT

WB

EXE

predictors

Page 59: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

1-bit “Self Correlating” Predictor

• Let’s consider a simple model. Store a bit per branch: “last time, was the branch taken or not”.

• Consider a loop of 10 iterations before exit:

• for (…) for (i=0; i<10; i++) a[i] = a[i] * 2.0;

• What’s the accuracy of this predictor?

Page 60: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Dynamic Branch Prediction• Performance = ƒ(accuracy, cost of misprediction)

• Branch History Table: Lower bits of PC address index table of 1-bit values

• Says whether or not branch taken last time

• No address check

• Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit):

• End of loop case, when it exits instead of looping as before

• First time through loop on next time through code, when it predicts exit instead of looping

Page 61: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Predictor for a Single Branch

state 2. PredictOutput T/NT

1. Access PC

3. Feedback T/NT

T

Predict Taken Predict Taken1 0T

NT

General Form

1-bit prediction

NT

Feedback

Page 62: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

• Solution: 2-bit scheme where change prediction only if get misprediction twice:

• Red: stop, not taken

• Green: go, taken

• Adds hysteresis to decision making process

Dynamic Branch Prediction (Jim Smith, 1981)

T

T NT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

NTT

NT

Page 63: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Simple (“2-bit”) Branch History Table Entry

D Q

Prediction for next branch.(1 = take, 0 = not take)

Initialize to 0.

D Q

Was last prediction correct?(1 = yes, 0 = no)

Initialize to 1.

Set to 1 if prediction bit was correct. Set to 0 if prediction bit was

incorrect.Set to 1 if prediction bit flips.

Flip bit if prediction is not correct and “last predict

correct” bit is 0.

After we “check” prediction ...

Page 64: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Branch prediction hardware• Branch Target Buffer (BTB): Address of branch index to

get prediction AND branch address (if taken)

• Note: must check for branch match now, since can’t use wrong branch address

Branch PC Predicted PC

=?

PC of instructionFETCH

Predict taken or untakenAddress of next

instruction fetched

Valid Branch(BTB knows about branch)

Page 65: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Some Interesting Patterns• Format: Not-taken (N) (0), Taken (T)(1)

• TTTTTTTTTT

• 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 … : Should give perfect prediction

• NNTTNNTTNNTT

• 0 0 1 1 1 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 0 0 1 0 … : Will mispredict 1/2 of the time

• N*N[TNTN]

• 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 … : Should alternate incorrectly

• N*T[TNTN]

• 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 … : Should alternate incorrectly

Page 66: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Pentium 4 Branch Prediction• Critical to Performance

• 20 cycle penalty for misprediction

• Branch Target Buffer

• 2048 entries

• 12 bits of history

• Adaptive algorithm

• Can recognize repeated patterns, e.g., alternating taken–not taken

• Handling BTB misses

• Detect in cycle 6

• Predict taken for negative offset, not taken for positive (why?)

Page 67: Lecture 4 Instruction Level Parallelism (2) - Nvidia · Lecture 4 Instruction Level Parallelism (2) EEC 171 Parallel Architectures ... • An alternate approach to machine parallelism:

Branch Prediction Summary• Consider for each branch prediction

• Hardware cost

• Prediction accuracy

• Warm-up time

• Correlation

• Interference

• Time to generate prediction

• Application behavior determines number of branches

• More control intensive program…more opportunity to mispredict

• What if a compiler/architecture could eliminate branches?