Top Banner
Topics Left • Superscalar machines • IA64 / EPIC architecture • Multithreading (explicit and implicit) • Multicore Machines • Clusters • Parallel Processors • Hardware implementation vs microprogramming
38

Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Apr 02, 2015

Download

Documents

Jaylin Collie
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Topics Left

• Superscalar machines

• IA64 / EPIC architecture

• Multithreading (explicit and implicit)

• Multicore Machines

• Clusters

• Parallel Processors

• Hardware implementation vs microprogramming

Page 2: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Chapter 14

Superscalar Processors

• Definition of Superscalar

• Design Issues:- Instruction Issue Policy- Register renaming- Machine parallelism- Branch Prediction- Execution

• Pentium 4 example

Page 3: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

What is Superscalar?

• “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

• Equally applicable to RISC & CISC, but more straightforward in RISC machines.

• The order of execution is usually assisted by the compiler.

A Superscalar machine executes multiple independent instructions in parallel.

They are pipelined as well.

Page 4: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Example of Superscalar Organization

• 2 Integer ALU pipelines,

• 2 FP ALU pipelines,

• 1 memory pipeline (?)

Page 5: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Superscalar v Superpipelined

Page 6: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Limitations of Superscalar

• Dependent upon:- Instruction level parallelism possible

- Compiler based optimization- Hardware support

• Limited by— Data dependency— Procedural dependency— Resource conflicts

Page 7: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

(Recall) True Data Dependency (Must W before R)

ADD r1, r2 r1+r2 r1

MOVE r3, r1 r1 r3

• Can fetch and decode second instruction in parallel with first

LOAD r1, X x (memory) r1

MOVE r3, r1 r1 r3

• Can NOT execute second instruction until first is finished

Second instruction is dependent on first (R after W)

Page 8: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

(recall) Antidependancy (Must R before W)

ADD R4, R3, 1 R3 + 1 R4

ADD R3, R5, 1 R5 + 1 R3

• Cannot complete the second instruction before the first has read R3

Page 9: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

(Recall) Procedural Dependency

• Can’t execute instructions after a branch in parallel with instructions before a branch, because?

Note: Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed

Page 10: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

(recall) Resource Conflict

• Two or more instructions requiring access to the same resource at the same time—e.g. two arithmetic instructions need the ALU

• Solution - Can possibly duplicate resources—e.g. have two arithmetic units

Page 11: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Effect of Dependencies on Superscalar Operation

Notes:

1) Superscalar operation is double impacted by a stall.

2) CISC machines typically have different length instructions and need to be at least partially decoded before the next can be fetched – not good for superscalar operation

Page 12: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Instruction-level Parallelism – degree of

• Consider:LOAD R1, R2

ADD R3, 1

ADD R4, R2These can be handled in parallel.

• Consider:ADD R3, 1

ADD R4, R3

STO (R4), R0

These cannot be handled in parallel.

The “degree” of instruction-level parallelism is determined by the number of instructions that can be executed in parallel without stalling for dependencies

Page 13: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Instruction Issue Policies

• Order in which instructions are fetched

• Order in which instructions are executed

• Order in which instructions update registers and memory values (order of completion)

Standard Categories:• In-order issue with in-order completion

• In-order issue with out-of-order completion

• Out-of order issue with out-of-order completion

Page 14: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

In-Order Issue -- In-Order Completion

Issue instructions in the order they occur:

• Not very efficient

• Instructions must stall if necessary (and stalling in superpipelining is expensive)

Page 15: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

In-Order Issue -- In-Order Completion (Example)

Assume:

• I1 requires 2 cycles to execute

• I3 & I4 conflict for the same functional unit

• I5 depends upon value produced by I4

• I5 & I6 conflict for a functional unit

Page 16: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

In-Order Issue -- Out-of-Order Completion(Example)

How does this effect interrupts?

Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit

Page 17: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Out-of-Order Issue -- Out-of-Order Completion

• Decouple decode pipeline from execution pipeline

• Can continue to fetch and decode until the “window” is full

• When a functional unit becomes available an instruction can be executed (usually in as much in-order as possible)

• Since instructions have been decoded, processor can look ahead

Page 18: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Out-of-Order Issue -- Out-of-Order Completion (Example)

Note: I5 depends upon I4, but I6 does not

Again:• I1 requires 2 cycles to execute• I3 & I4 conflict for the same functional unit• I5 depends upon value produced by I4• I5 & I6 conflict for a functional unit

Page 19: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Register Renaming to avoid hazards

• Output and antidependencies occur because register contents may not reflect the correct ordering from the program

• Can require a pipeline stall

• One solution: Allocate Registers dynamically (renaming registers)

Page 20: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Register Renaming example

Add R3, R3, R5 R3b:=R3a + R5a (I1) Add R4, R3, 1 R4b:=R3b + 1 (I2) Add R3, R5, 1 R3c:=R5a + 1 (I3) Add R7, R3, R4 R7b:=R3c + R4b (I4)

• Without “subscript” refers to logical register in instruction

• With subscript is hardware register allocated: R3a R3b R3c

Note: R3c avoids: antidependency on I2 output dependency I1

Page 21: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Recaping: Machine Parallelism Support

• Duplication of Resources

• Out of order issue hardware

• Windowing to decouple execution from decode

• Register Renaming capability

Page 22: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Speedups of Machine Organizations (Without Procedural Dependencies)

• Not worth duplication of functional units without register renaming• Need instruction window large enough (more than 8, probably not more than 32)

Page 23: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Branch Prediction in Superscalar Machines

• Delayed branch not used much. Why? Multiple instructions need to execute in the delay slot.

This leads to much complexity in recovery.

• Branch prediction should be used - Branch history is very useful

Page 24: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

View of Superscalar Execution

Page 25: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Committing or Retiring Instructions

Results need to be put into order (commit or retire)

• Results sometimes must be held in temporary storage until it is certain they can be placed in “permanent” storage.

(either committed or retired/flushed)

• Temporary storage requires regular clean up – overhead – done in hardware.

Page 26: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Superscalar Hardware Support

• Facilities to simultaneously fetch multiple instructions

• Logic to determine true dependencies involving register values and Mechanisms to communicate these values

• Mechanisms to initiate multiple instructions in parallel

• Resources for parallel execution of multiple instructions

• Mechanisms for committing process state in correct order

Page 27: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Example: Pentium 4A Superscalar CISC Machine

Page 28: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Pentium 4 alternate view

Page 29: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

Pentium 4 pipeline

20 stages !

Page 30: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

a) Generation of Micro-ops (stages 1 &2)

• Using the Branch Target Buffer and Instruction Translation Lookaside Buffer, the x86 instructions are fetched 64 bytes at a time from the L2 cache

•The instruction boundaries are determined and instructions decoded into 1-4 118-bit RISC micro-ops

• Micro-ops are stored in the trace cache

Page 31: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

b) Trace cache next instruction pointer (stage 3)

• The Trace Cache Branch Target Buffer contains dynamic gathered history information (4 bit tag)

• If target is not in BTB

- Branch not PC relative: predict branch taken if it is a return, predict not taken otherwise

- For PC relative backward conditional branches, predict take, otherwise not taken

Page 32: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

c) Trace Cache fetch (stage 4)

• Orders micro-ops in program-ordered sequences called traces

• These are fetched in order, subject to branch prediction

• Some micro-ops require many micro-ops (CISC instructions). These are coded into the ROM and fetched from the ROM

Page 33: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

d) Drive (stage 5)

• Delivers instructions from the Trace Cache to the Rename/Allocator module for reordering

Page 34: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

e) Allocate: register naming (stages 6, 7, & 8)

• Allocates resources for execution (3 micro-ops arrive per clock cycle):- Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB)

which tracks progress of the micro-ops. Buffer entries include: - State – scheduled, dispatched, completed, ready for retire

- Address that generated the micro-op - Operation - Alias registers are assigned for one of 16 arch reg (128 alias registers) {to remove data dependencies}

• The micro-ops are dispatched out of order as resources are available

• Allocates an entry to one of the 2 scheduler queues - memory access or not

• The micro-ops are retired in order from the ROB

Page 35: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

f) Micro-op queuing (stage 9)

• Micro-ops are loaded into one of 2 queues:

- one for memory operations

- one for non memory operations

• Each queue operates on a FIFO policy

Page 36: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

g) Micro-op scheduling

(stages 10, 11, & 12)

• The 2 schedulers retrieve micro-ops based upon having all the operands ready and dispatch them to an available unit (up to 6 per clock cycle)

• If two micro-ops need the same unit, they are dispatched in sequence.

h) Dispatch (stages 13 & 14)

Page 37: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

i) Register file (stages 15 & 16)

j) Execute: flags (stages 17 & 18)

• The register files are the sources for pending fixed and FF operations

• A separate stage is used to compute the flags

Page 38: Topics Left Superscalar machines IA64 / EPIC architecture Multithreading (explicit and implicit) Multicore Machines Clusters Parallel Processors Hardware.

k) Branch check (stage 19)

l) Branch check results (stage 20)

• Checks flags and compares results with predictions

• If the branch prediction was wrong:

- all incorrect micro-ops must be flushed (don’t want to be wrong!)

- the correct branch destination is provided to the Branch Predictor

- the pipeline is restarted from the new target address