Top Banner
Superscalar Superscalar Organization Organization Prof. Mikko H. Lipasti University of Wisconsin- Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti
35

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Jan 15, 2016

Download

Documents

Gina Crafts
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Superscalar OrganizationSuperscalar Organization

Prof. Mikko H. LipastiUniversity of Wisconsin-Madison

Lecture notes based on notes by John P. ShenUpdated by Mikko Lipasti

Page 2: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Limitations of Scalar PipelinesLimitations of Scalar Pipelines

Scalar upper bound on throughput– IPC <= 1 or CPI >= 1

Inefficient unified pipeline– Long latency for each instruction

Rigid pipeline stall policy– One stalled instruction stalls all newer

instructions

Page 3: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Parallel PipelinesParallel Pipelines

(a) No Parallelism (b) Temporal Parallelism

(c) Spatial Parallelism

(d) Parallel Pipeline

Page 4: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Spatial Pipeline UnrollingSpatial Pipeline Unrolling

Page 5: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Pipeline Unrolling - PowerPipeline Unrolling - Power

12-stage pipeline, 60% latch power, 25% latch delay+setup, 12% area on latches, 20% leakage power

© Shen, Lipasti5

Page 6: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Intel Pentium Parallel PipelineIntel Pentium Parallel Pipeline

IF

D1

D2

EX

WB

IF IF

D1 D1

D2 D2

EX EX

WB WB

U - Pipe V - Pipe

Page 7: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Diversified PipelinesDiversified Pipelines

• • •

• • •

• • •

• • •IF

ID

RD

WB

ALU MEM1 FP1 BR

MEM2 FP2

FP3

EX

Page 8: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Power4 Diversified PipelinesPower4 Diversified PipelinesPCI-Cache

BR Scan

BR Predict

Fetch Q

Decode

Reorder BufferBR/CRIssue Q

CRUnit

BRUnit

FX/LD 1Issue Q

FX1Unit LD1

Unit

FX/LD 2Issue Q

LD2Unit

FX2Unit

FPIssue Q

FP1Unit

FP2Unit

StQ

D-Cache

Page 9: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Rigid Pipeline Stall PolicyRigid Pipeline Stall Policy

Bypassing of StalledInstruction

Stalled Instruction

Backward Propagationof Stalling

Not Allowed

Page 10: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Dynamic PipelinesDynamic Pipelines

• • •

• • •

• • •

• • •IF

ID

RD

WB

ALU MEM1 FP1 BR

MEM2 FP2

FP3

EX

DispatchBuffer

ReorderBuffer

( in order )

( out of order )

( out of order )

( in order )

Page 11: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Interstage BuffersInterstage Buffers

• • •

• • •

• • •

Stage i

Buffer (n)

Stage i +1

Stage i

Buffer (> n)

Stage i + 1

n

n

Stage i

Buffer (1)

Stage i + 1

(a) (b)

(c)

• • •

( in order )

( out of order )_

( in order )1

1

• • •

( in order )

Page 12: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Superscalar Pipeline StagesSuperscalar Pipeline Stages

Instruction Buffer

Fetch

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

In Program

Order

In Program

Order

Outof

Order

Page 13: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Limitations of Scalar PipelinesLimitations of Scalar Pipelines

Scalar upper bound on throughput– IPC <= 1 or CPI >= 1– Solution: wide (superscalar) pipeline

Inefficient unified pipeline– Long latency for each instruction– Solution: diversified, specialized pipelines

Rigid pipeline stall policy– One stalled instruction stalls all newer instructions– Solution: Out-of-order execution, distributed

execution pipelines

Page 14: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Impediments to High IPCImpediments to High IPC

I-cache

FETCH

DECODE

COMMIT

D-cache

BranchPredictor Instruction

Buffer

StoreQueue

ReorderBuffer

Integer Floating-point Media Memory

Instruction

RegisterData

MemoryData

Flow

EXECUTE

(ROB)

Flow

Flow

Page 15: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Superscalar Pipeline DesignSuperscalar Pipeline Design

Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues

Page 16: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Instruction FlowInstruction Flow

Challenges:– Branches: control dependences– Branch target misalignment– Instruction cache misses

Solutions– Code alignment (static vs.dynamic)– Prediction/speculation

Instruction Memory

PC

3 instructions fetched

Objective: Fetch multiple instructions per cycle

Page 17: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

I-Cache OrganizationI-Cache OrganizationR

ow

De

co

de

r

••

CacheLine

••

TAG

TAG

Address

1 cache line = 1 physical row

••

•Cache

Line

••

TAG

TAG

Address

1 cache line = 2 physical rows

TAG

TAG Ro

w D

ec

od

er

Page 18: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Fetch AlignmentFetch Alignment

Page 19: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

RIOS-I Fetch HardwareRIOS-I Fetch Hardware

Page 20: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Issues in DecodingIssues in Decoding

Primary Tasks– Identify individual instructions (!)– Determine instruction types– Determine dependences between

instructionsTwo important factors

– Instruction set architecture– Pipeline width

Page 21: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Pentium Pro Fetch/DecodePentium Pro Fetch/Decode

Page 22: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Predecoding in the AMD K5Predecoding in the AMD K5

Page 23: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Instruction Dispatch and IssueInstruction Dispatch and Issue

Parallel pipeline– Centralized instruction fetch– Centralized instruction decode

Diversified pipeline– Distributed instruction execution

Page 24: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Necessity of Instruction DispatchNecessity of Instruction Dispatch

Page 25: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Centralized Reservation StationCentralized Reservation Station

Page 26: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Distributed Reservation StationDistributed Reservation Station

Page 27: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Issues in Instruction ExecutionIssues in Instruction Execution

Current trends– More parallelism bypassing very challenging– Deeper pipelines– More diversity

Functional unit types– Integer– Floating point– Load/store most difficult to make parallel– Branch– Specialized units (media)

Very wide datapaths (256 bits/register or more)

Page 28: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Bypass NetworksBypass Networks

O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time)

– Use RF only (IBM Power4) with no bypass network– Decompose into clusters (Alpha 21264)

PCI-Cache

BR Scan

BR Predict

Fetch Q

Decode

Reorder BufferBR/CRIssue Q

CRUnit

BRUnit

FX/LD 1Issue Q

FX1Unit LD1

Unit

FX/LD 2Issue Q

LD2Unit

FX2Unit

FPIssue Q

FP1Unit

FP2Unit

StQ

D-Cache

Page 29: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Specialized unitsSpecialized units

Carry

Intel Pentium 4 staggered adders– Fireball

Run at 2x clock frequency

Two 16-bit bitslices Dependent ops

execute on half-cycle boundaries

Full result not available until full cycle later

Page 30: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Specialized unitsSpecialized units FP multiply-

accumulateR = (A x B) + C

Doubles FLOP/instruction

Lose RISC instruction format symmetry:– 3 source operands

Widely used

Page 31: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Media Data TypesMedia Data Types

Subword parallel vector extensions– Media data (pixels, quantized datum) often 1-2 bytes– Several operands packed in single 32/64b register

{a,b,c,d} and {e,f,g,h} stored in two 32b registers– Vector instructions operate on 4/8 operands in

parallel– New instructions, e.g. sum of abs. differences (SAD)

me = |a – e| + |b – f| + |c – g| + |d – h| Substantial throughput improvement

– Usually requires hand-coding of critical loops– Shuffle ops (gather/scatter of vector elements)

e f g ha b c d

Page 32: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Issues in Completion/RetirementIssues in Completion/Retirement

Out-of-order execution– ALU instructions– Load/store instructions

In-order completion/retirement– Precise exceptions– Memory coherence and consistency

Solutions– Reorder buffer– Store buffer– Load queue snooping (later)

Page 33: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

A Dynamic Superscalar ProcessorA Dynamic Superscalar Processor

Page 34: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Impediments to High IPCImpediments to High IPC

I-cache

FETCH

DECODE

COMMIT

D-cache

BranchPredictor Instruction

Buffer

StoreQueue

ReorderBuffer

Integer Floating-point Media Memory

Instruction

RegisterData

MemoryData

Flow

EXECUTE

(ROB)

Flow

Flow

Page 35: Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Superscalar OverviewSuperscalar Overview

Instruction flow– Branches, jumps, calls: predict target, direction– Fetch alignment– Instruction cache misses

Register data flow– Register renaming: RAW/WAR/WAW

Memory data flow– In-order stores: WAR/WAW– Store queue: RAW– Data cache misses