Top Banner
Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), smit @ cs.utwente.nl André Kokkeler (Zilverling 4096), kokkeler @ utwente.nl
49

Embedded Computer Architectures

Jan 05, 2016

Download

Documents

orien

Embedded Computer Architectures. Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102), [email protected] André Kokkeler (Zilverling 4096), [email protected]. Contents. Introduction Hazards
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Embedded Computer Architectures

EmbeddedComputerArchitectures

Hennessy & PattersonChapter 3

Instruction-Level Parallelism and Its Dynamic Exploitation

Gerard Smit (Zilverling 4102), [email protected]

André Kokkeler (Zilverling 4096), [email protected]

Page 2: Embedded Computer Architectures

Contents

• Introduction• Hazards <= dependencies• Instruction Level Parallelism; Tomasulo’s

approach• Branch prediction

Page 3: Embedded Computer Architectures

Dependencies

• True Data dependency• Name dependency

—Antidependency—Output dependency

• Control dependency

Page 4: Embedded Computer Architectures

Data Dependency

Inst i

Inst i+1

Inst i+2

Result

Result

DataDep

DataDep

DataDep

Two instructions are data dependent => risk of RAW hazard

Page 5: Embedded Computer Architectures

Name Dependency

• Antidependence

• Output dependence

Inst i

register or memory location

Inst jWrite

Read

Two instructions are antidependent => risk of WAR hazard

Inst i

register or memory location

Inst jWrite

Write

Two instructions are antidependent => risk of WAW hazard

Page 6: Embedded Computer Architectures

Control Dependency

• Branch condition determines whether instruction i is executed => i is control dependent on the branch

Page 7: Embedded Computer Architectures

Instruction Level Parallelism

• Pipelining = ILP• Other approach: Dynamic scheduling =>

out of order execution• Instruction Decode stage split into

—Issue (decode, check for structural hazards)—Read Operands

Page 8: Embedded Computer Architectures

Instruction Level Parallelism

• Scoreboard:—Sufficient resources—No data dependencies

• Tomasulo’s approach—Minimize RAW hazards—Register renaming to minimize WAW and RAW

hazards

issueRead

operands

Reservation Station (park instructions while waiting for operands)

Page 9: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Register F0

start of instruction

register use of instruction

Time

Page 10: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Register F0

Time

Problems if arrows cross

Page 11: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Register F0

Time

Instr 2, 3,… will be stalled. Note that Instr 2 and 3 are stalled only becauseInstr 1 is not ready. If not for Instr 1, they could be executed earlier

Page 12: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Instr 3.Register F0

Instr 1.Register F0

How is it arranged that value is written into Instr 3. Register F0 and not inInstr 1. Register F0?

Page 13: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Instr 3.Register F0Instr 3.F0Source

Instr 1.Register F0Instr 1.F0Source Instr. k

Instr. 2

The result of Instr 2 is labelled with ‘Instr. 2’. Hardware checks whether thereIs an instruction waiting for the result (checking the F0Source fields of instructions)And places the result in the correct place.

Page 14: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Instr 3.Register F0Instr 3.F0Source Instr. 2

F0Data F0Sourceoperation (read)

Page 15: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

F0Data F0Sourceoperation (read)

F0Data F0Sourceoperation (read)

Page 16: Embedded Computer Architectures

Tomasulo’s approach

• Register Renaming

1. Read F0

2. Write F0

3. Read F0

4. Write F0

F0Data F0Sourceoperation (read)

F0Data F0Sourceoperation (read)

operation (write)

operation (write)

Reservation StationIssue

Filled during executionFilled during Issue

Page 17: Embedded Computer Architectures

Tomasulo’s approach

• Effects—Register Renaming: prevents WAW and WAR

hazards—Execution starts when operands are available

(datafields are filled): prevents RAW

Page 18: Embedded Computer Architectures

Tomasulo’s approach

• Issue in more detail (issue is done sequentially)

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Empty ?????read1

Empty

write1

Reservation Station

read2

read

read

write

datalabel operation sourceFormat:

This is the only information you have:During issue, you have to keep trackwhich instruction changed F0 last!!!!

Page 19: Embedded Computer Architectures

Tomasulo’s approach

• Issue in more detail

1. Read F0

2. Write F0

3. Read F0

4. Write F0

Empty ?????read1

Empty Write1

write1

Reservation Station

read2

write2

read

read

write

write

datalabel operation sourceFormat:

????

write1

write1

write2

F0

Keeping track of register statusduring issue is done for every register

Page 20: Embedded Computer Architectures

Tomasulo’s approach

• Definitions for the MIPS

—For each reservation station:

Name Busy Operation Vj Vk Qj Qk A

Name = labelBusy = in execution or notOperation = instructionV = operand valueQ = operand sourceA = memory address (Load, Store)

Page 21: Embedded Computer Architectures

Tomasulo’s approach; hardware view

Issue hardware

Reservation Station

“Execution ControlHardware”

Execution Units

“Reservation FillHardware”

Common Data Bus

Of which instructionsare operands and corresponding executionunits available? =>

Transport operands toexecutions unit

Puts data in correctplace in reservationstation

From instruction queue

Register RenamingFill ReservationStations

Results + identification Of instruction producing the result

Page 22: Embedded Computer Architectures

Branch prediction

• Data Hazards => Tomasulo’s approach• Branch (control) hazards => Branch

prediction—Goal: Resolve outcome of branch early =>

prevent stalls because of control hazards

Page 23: Embedded Computer Architectures

Branch prediction; 1 history bit

• Example:

Outerloop: …R=10

Innerloop: …R=R-1BNZ R, Innerloop……Branch Outerloop

History bit

History bit: is branch taken previously or not: - predict taken: fetch from ‘Innerloop’- predict not taken: fetch next instr

Actual outcome of branch: - taken: set history bit to ‘taken’- not taken: set history bit to ‘not taken’

In this situation: Correct prediction in 80 % of branch evaluations

Page 24: Embedded Computer Architectures

Branch prediction; 2 history bits

• Example:

Outerloop: …R=10

Innerloop: …R=R-1BNZ R, Innerloop……Branch Outerloop

2 history bits

Predict taken Predict taken

Predict not takenPredict not taken

Not taken

Not taken

Not taken

taken

taken

taken

In this application:correct predictionin 90 % of branchevaluations

Page 25: Embedded Computer Architectures

Branch prediction; Correlating branch predictors

If (aa == 2)aa=0;

If (bb == 2)bb=0;

If (aa != bb)

Results of these branches areused in prediction of this branch

Example: suppose aa == 2 and bb == 2 then condition for last ‘if’ is always false =>if previous two branches are not taken, last branch is taken.

Page 26: Embedded Computer Architectures

Branch prediction; Correlating branch predictors

• Mechanism: Suppose result of 3 previous branches is used to influence decision.

• 8 possible sequences:br-3 br-2 br-1 brNT NT NT TNT NT T NT…. …. …. …. T T T T

• Dependent on outcome of branch under consideration prediction is changed:—1 bit history: (3,1) predictor

Branch under consideration

For the sequence(NT NT NT) the prediction isthat the branch will be taken=> Fetches from branchdestination

Page 27: Embedded Computer Architectures

Branch prediction; Correlating branch predictors

• Mechanism: Suppose result of 3 previous branches is used to influence decision.

• 8 possible sequences:br-3 br-2 br-1 brNT NT NT TNT NT T NT…. …. …. …. T T T T

• Dependent on outcome of branch under consideration prediction is changed:—1 bit history: (3,1) predictor—2 bit history: (3,2) predictor

Branch under consideration

For the sequence(NT NT NT) the prediction isthat the branch will be taken=> Fetches from branchdestination

Represented by 2 bits-2 combinations indicate:predict taken-2 combinations indicate: predict non takenUpdated by means of statemachine

Page 28: Embedded Computer Architectures

Branch Target Buffer

• Solutions:—Delayed Branch—Branch Target buffer

BNEZ R1, Loop IF ID EX MEM WBnext instr IF ID EX MEM WB

branch target IF ID EX MEM WB

Even with a good prediction, we don’t know where to branch too until here and we’ve already retrieved the next instruction

Page 29: Embedded Computer Architectures

Branch Target Buffer

Memory(Instruction

cache)

ProgramCounter

Address Branch Target

Corresponding BranchTargets

Addresses of branch instructions

Hit?

From Instruction Decode hardware

Select

After IF stage, branchaddress already in PC

Page 30: Embedded Computer Architectures

Branch Folding

Memory(Instruction

cache)

ProgramCounter

Address Instruction at target

Corresponding Instructionsat Branch Targets

Addresses of branch instructions

Hit?

Unconditional Branches:Effectively removing Branch instruction (penalty of -1)

Page 31: Embedded Computer Architectures

Return Address Predictors

• Indirect branches: branch address known at run time.

• 80% of time: return instructions.• Small fast stack:

RET

ProcedureCall

ProcedureReturn

RET

Page 32: Embedded Computer Architectures

Multiple Issue Processors

Goal: Issue multiple instructions in a clockcycle

• Superscalarissue varying number of instructions per clock—Statically scheduled—Dynamically scheduled

• VLIWissue fixed number of instructions per clock—Statically scheduled

Page 33: Embedded Computer Architectures

Multiple Issue Processors

• Example

Instruction type

Pipe Stages

Integer IF ID EX MEM WB

FP IF ID EX EX EX WB

Integer IF ID EX MEM WB

FP IF ID EX EX EX WB

Integer IF ID EX MEM WB

FP IF ID EX EX EX WB

Integer IF ID EX MEM WB

FP IF ID EX EX EX

Page 34: Embedded Computer Architectures

Hardware Based Speculation

• Multiple Issue Processors => nearly 1 branch every clock cycle

• Dynamic scheduling + branch prediction:fetch+issue

• Dynamic scheduling + branch speculation:fetch+issue+execution

• KEY: Do not perform updates that cannot be undone until you’re sure the corresponding operation really should be executed.

Page 35: Embedded Computer Architectures

Hardware Based Speculation

• Tomasulo:

Branch (Predict Not Taken)

Register FileOperation i

Operation k

Operations beyond this point are finished

Issued

Operation k:-Operand available-Execution postponed until clear whether branch is taken

Page 36: Embedded Computer Architectures

Hardware Based Speculation

• Tomasulo:

Branch (Predict Not Taken)

Register FileOperation i

Operation k

Issued

Finished

Dependent on outcome branch:-Flush reservation stations-Start execution

Page 37: Embedded Computer Architectures

Hardware Based Speculation

• Speculation:

Branch (Predict Not Taken)

Register FileOperation i

Operation k

Results of operations beyond this point are committed (from reorder buffer to register file)

Issued

Operation k:-Operand available and executed

Reorder Buffer

Commit: sequentially

Page 38: Embedded Computer Architectures

Hardware Based Speculation

• Speculation:

Branch (Predict Not Taken)

Register FileOperation i

Operation k

Issued

Operation k:-Operand available and executed

Reorder Buffer

Commit: sequentiallyCommitted

Page 39: Embedded Computer Architectures

Hardware Based Speculation

• Speculation:

Branch (Predict Not Taken)

Register FileOperation i

Operation k

Operation k:-Operand available and executed

Reorder Buffer

Commit: sequentially

Committed

Page 40: Embedded Computer Architectures

Hardware Based Speculation

• Speculation:

Branch (Predict Not Taken)

Register FileOperation i

Operation k

Operation k:-Operand available and executed

Reorder Buffer

Commit: sequentially

Committed

Page 41: Embedded Computer Architectures

Hardware Based Speculation

• Some aspects—Instructions causing a lot of work should not

have been executed => restrict allowed actions in speculative mode

—ILP of a program is limited—Realistic branch predictions: easier to

implement => less efficient

Page 42: Embedded Computer Architectures

Pentium Pro Implementation

• Pentium Family

Processor Year Clock Rate (MHz)

L1 Cache (instr, data)

L2 Cache(instr, data)

Pentium Pro 1995 100-200 8 KB, 8 KB 256 KB, 1024 KB

Pentium I 1998 233-450 16 KB, 16 KB

256 KB, 512 KB

Pentium II Xeon

1999 400-450 16 KB, 16 KB

512 KB, 2 MB

Celeron 1999 500-900 16 KB, 16 KB

128 KB

Pentium III 1999 450-1100 16 KB, 16 KB

256 KB, 512 KB

Pentium III Xeon

2000 700-900 16 KB, 16 KB

1 MB, 2 MB

Page 43: Embedded Computer Architectures

Pentium Pro Implementation

• I486: CISC => problems with pipelining• 2 observations

—Translation CISC instructions into sequence of microinstructions

—Microinstruction is of equal length

• Solution: pipelining microinstructions

Page 44: Embedded Computer Architectures

Pentium Pro Implementation

...Jump to Indirect or Execute

...Jump to Execute

...Jump to Fetch

Jump to Op code routine

...

Jump to Fetch or Interrupt

...

Jump to Fetch or Interrupt

Fetch cycle routine

Indirect Cycle routine

Interrupt cycle routine

Execute cycle begin

AND routine

ADD routine

Note: each micro-program ends with a branch to the Fetch, Interrupt, Indirect or Execute micro-program

Page 45: Embedded Computer Architectures

Pentium Pro Implementation

Page 46: Embedded Computer Architectures

Pentium Pro Implementation

• All RISC features are implemented on the execution of microinstructions instead of machine instructions—Microinstruction-level pipeline with dynamically scheduled

microoperations– Fetch machine instruction (3 stages)– Decode machine instruction into microinstructions (2 stages)– Issue microinstructions (2 stages, register renaming, reorder

buffer allocation performed here)– Execute of microinstructions (1 stage, floating point units

pipelined, execution takes between 1 and 32 cycles)– Write back (3 stages)– Commit (3 stages)

—Superscalar can issue up to 3 microoperations per clock cycle

—Reservation stations (20 of them) and multiple functional units (5 of them)

—Reorder buffer (40 entries) and speculation used

Page 47: Embedded Computer Architectures

Pentium Pro Implementation• Execution Units have the following stages • Integer ALU1• Integer Load 3• Integer Multiply 4• FP add 3• FP multiply 5 (partially pipelined –multiplies

can start every other cycle)

• FP divide 32 (not pipelined)

Page 48: Embedded Computer Architectures

Thread-Level Parallelism

• ILP: on instruction level• Thread-Level Parallelism: on a higher level

—Server applications—Database queries

• Thread: has all information (instructions, data, PC register state etc) to allow it to execute—On a separate processer—As a process on a single process.

Page 49: Embedded Computer Architectures

Thread-Level Parallelism

• Potentially high efficiency• Desktop applications:

—Costly to switch to ‘thread-level reprogrammed’ applications.

—Thread level parallelism often hard to find

=> ILP continues to be focus for desktop-oriented processors (for embedded processors, the situation is different)