Top Banner
Superscalar Processors by Sherri Sparks
42

Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Jan 03, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Superscalar Processors

bySherri Sparks

Page 2: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Overview

1. What are superscalar processors?

2. Program Representation, Dependencies, & Parallel Execution

3. Micro architecture of a typical superscalar processor

4. A look at 3 superscalar implementations

5. Conclusion: The future of superscalar processing

Page 3: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

What are superscalars and how do they differ from pipelines?

In simple pipelining, you are limited to fetching 1 single instruction into the pipeline per clock cycle. This causes a performance bottleneck.

Superscalar processors overcome the 1 instruction per clock cycle limit of simple pipelines and possess the ability to fetch multiple instructions during the same clock cycle. They also employ advanced techniques like “branch prediction” to ensure an uninterrupted stream of instructions.

Page 4: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Development & History of Superscalars

Pipelining was developed in the late 1950’s and became popular in the 1960’s.

Examples of early pipelined architectures are the CDC 6600 and the IBM 360/91 (Tomasulo’s algorithm)

Superscalars appeared in the mid to late 1980’s

Page 5: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Instruction Processing Model

Need to maintain software compatibility.

The assembly instruction set was the level chosen to maintain compatibility because it did not affect existing software.

Need to maintain at least a semblance of a “sequential execution model” for programmers who rely on the concept of sequential execution in software design.

A superscalar processor may execute instructions out of order at the hardware level, but execution must *appear* sequential at the programming level.

Page 6: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Superscalar Implementation

Instruction fetch strategies that simultaneously fetch multiple instructions often by using branch prediction techniques.

Methods for determining data dependencies and keeping track of register values during execution

Methods for issuing multiple instructions in parallel

Resources for parallel execution of many instructions including multiple pipelined functional units and memory hierarchies capable of simultaneously servicing multiple memory references.

Methods for communicating data values through memory through load and store instructions.

Methods for committing the process state in correct order. This is to maintain the outward appearance of sequential execution.

Page 7: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

From Sequential to Parallel…

Parallel execution often results in instructions completing non sequentially.

Speculative execution means that some instructions may be executed when they would not have been executed at all according to the sequential model (i.e. incorrect branch prediction).

To maintain the outward appearance of sequential execution for the programmer, storage cannot be updated immediately. The results must be held in temporary status until the storage us updated. Meanwhile, these temporary results must be usable by dependant instructions.

When its determined that the sequential model would have executed an instruction, the temporary results are made permanent by updating the outward state of the machine. This process is called “committing” the instruction.

Page 8: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Dependencies

Parallel Execution introduces 2 types of dependencies

Control dependencies due to incrementing or updating the program counter in response to conditional branch instructions.

Data dependencies due to resource contention as instructions may need to read / write to the same storage or memory locations.

Page 9: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Overcoming Control Dependencies Example

L2: mov r3,r7lw r8,(r3)add r3,r3,4lw r9,(r3)ble r8,r9,L3

move r3,r7sw r9,(r3)add r3,r3,4sw r8,(r3)add r5,r5,1

L3: add r6,r6,1add r7,r7,4blt r6,r4,L2

Blocks are issued are initiated into the “window of execution”.

Block 1

Block 2

Block 3

Page 10: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Control Dependencies & Branch

Predicition To gain the most parallelism, control dependencies due to conditional

branches has to be overcome.

Branch prediction attempts to overcome this by predicting the outcome of a branch and speculatively fetching and executing instructions from the predicted path.

If the predicted path is correct, the speculative status of the instructions is removed and they affect the state of the machine like any other instruction.

If the predicted path is wrong, then recovery actions are taken so as not to incorrectly modify the state of the machine.

Page 11: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Data Dependencies

Data dependencies occur because instructions may access the same register or memory location

3 Types of data dependencies or “hazards”

RAW (“read after write) : occurs because a later instruction can only read a value after a previous instruction has written it.

WAR (“write after read”) : occurs when an instruction needs to write a new value into a storage location but must wait until all preceding instructions needing to read the old value have done so.

WAW (“write after write”) : occurs when multiple instructions update the same storage location; it must appear that these updates occur in the proper sequence.

Page 12: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Data Dependency Example

mov r3,r7

lw r8,(r3)

add r3,r3,4

lw r9,(r3)

ble r8,r9,L3

WAW

RAW

WAR

Page 13: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Parallel Execution Method

1. Instructions are fetched using branch prediction to form a dynamic stream of instructions

2. Instructions are examined for dependencies and dependencies are removed

3. Examined instructions are dispatched to the “window of execution” (These instructions are no longer in sequential order, but are ordered according to their data dependencies.

4. Instructions are issued from the window in an order determined by their dependencies and hardware resource availability.

5. Following execution, instructions are put back into their sequential program order and then “committed” so their results update the machine state.

Page 14: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Superscalar Microarchitecture Parallel Execution Method Summarized in 5 phases:

1. Instruction Fetch & Branch Prediction

2. Decode & Register Dependence Analysis

3. Issue & Execution

4. Memory Operation Analysis & Execution

5. Instruction Reorder & Commit

Page 15: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Superscalar Microarchitecture

Page 16: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Instruction Fetch & Branch Prediction

Fetch phase must fetch multiple instructions per cycle from cache memory to keep a steady feed of instructions going to the other stages.

The number of instructions fetched per cycle should match or be greater than the peak instruction decode & execution rate (to allow for cache misses or occasions where the max # of instructions can’t be fetched)

For conditional branches, fetch mechanism must be redirected to fetch instructions from branch targets.

4 steps to processing conditional branch instructions1. Recognizing that in instruction is a conditional branch2. Determining the branch outcome (taken or not taken)3. Computing the branch target4. Transferring control by redirecting instruction fetch (as in the case of a taken branch)

Page 17: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Processing Conditional Branches

STEP 1: Recognizing Conditional Branches

Instruction decode information is held in the instruction cache. These extra bits are used to identify the basic instruction types.

Page 18: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Processing Conditional BranchesSTEP 2: Determining Branch Outcome

Static Predictions (information determined from static binary). Ex: Certain opcode types might result in more branches taken than others or a backwards branch direction might be more likely in loops.

Predictions based on profiling information (execution statistics collected during a previous run of the program).

Dynamic Predictions (information gathered during program execution about past history of branch outcomes). Branch history outcomes are stored in a “branch history table” or a “branch prediction table”.

Page 19: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Processing Conditional BranchesSTEP 3: Computing Branch Targets

Branch targets are usually relative to the program counter and are computed as:

branch target = program counter + offset

Finding target addresses can be sped up by having a “branch target buffer which holds the target address used the last time the branch was executed.

EX: Branch Target Address Cache used in PowerPC 604

Page 20: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Processing Conditional BranchesSTEP 4: Transferring Control

Problem: Thee is often a delay in recognizing a branch, modifying the program counter and fetching the target instructions.

Several Solutions:

1. Use the stockpiled instructions in the instructions buffer to mask the delay2. Use a buffer that contains instructions from both “taken” and “not taken”

branch paths3. Delayed Branches – Branch does not take effect until instruction after the

branch. This allowed the fetch of target instructions to overlap execution of the instruction following the branch. The also introduce assumptions about pipeline structure and therefore delayed branches are rarely used anymore.

Page 21: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Instruction Decoding, Renaming, & Dispatch

Instructions are removed from the fetch buffers, decoded and examined for control and data dependencies.

Instructions are dispatched to buffers associated with hardware functional units for later issuing and execution.

Page 22: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Instruction Decoding

The decode phase sets up “execution tuples” for each instruction.

An “execution tuple” contains: An operation to be executed The identities of storage elements where input operands

will eventually reside The locations where an instructions result must be placed

Page 23: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Register Renaming

Used to eliminate WAW and RAW dependencies.

2 Types:

Physical register file is larger than logical register file and a mapping table is used to associate physical register values with logical register values. Physical registers are assigned from a “free list”.

Reorder Buffer: Uses the same size physical and logical register files. There is also a “reorder buffer” that contains 1 entry per active instruction and maintains the sequential ordering of instructions. It is a circular queue implemented in hardware. As instructions are dispatched they enter the queue at the tail. As instructions complete, their results are inserted into their assigned locations in the reorder buffer. When an instructions reaches the head of the queue, its entry is removed and its result placed in the register file.

Page 24: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Register Renaming I

R2 R6 R13

R8

R7

R5

R1

R9

r0

r1

r2

r3

r4

Before: add r3,43,4

Mapping Table:

Free List: R6 R13

R8

R7

R5

R2

R9

r0

r1

r2

r3

r4

Free List:

Mapping Table:

After: add R2,R1,4

Page 25: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Register Renaming II(using a reorder buffer)

r0

r1

r2

rob6

r4

r0

r1

r2

r3

r4

Before: add r3,r3,4

Mapping Table:

Recorder Buffer:(partial) ….... r3

7 6 0

r0

r1

r2

rob8

r4

r0

r1

r2

r3

r4

After: add r3,rob6,4(rob8)

Mapping Table:

Recorder Buffer:(partial) ………r3 .

8 7 0

r3

6

Page 26: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Instruction Issuing & Parallel Execution Instruction issuing is defined as the run-time

checking for availability of data and resources.

Constraints on instruction issue:

Availability of physical resources like instruction units, interconnect, and register file

Organization of buffers holding execution tuples

Page 27: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Single Queue Method

If there is no out of order issuing, operand availability can be managed via reservation bits assigned to each register.

A register is reserved when an instruction modifying the register issues.

A register is cleared when the instruction completes.

Instructions may issue if there are no reservations on its operands.

Page 28: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Multiple Queue Method

There are multiple queues organized according to instruction type.

Instructions issue from individual queues in sequential order.

Individual queues may issue out of order with respect to one another.

Page 29: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Reservation Stations

Instructions issue out of order

Reservation stations hold information about source operands for an operation.

When all operands are present, the instruction may issue.

Reservation stations may be partitioned according to instruction type or pooled into a single large block.

Operation Source 1 Data 1 Valid 1 Source 2 Data 2 Valid 2 Destination

Page 30: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Memory Operation Analysis & Execution

To reduce latency, memory hierarchies are used & may contain primary and secondary caches.

Address translation to physical addresses is improved by using a “translation lookaside buffer” which contains a cache of recently accessed pages.

“Multiported” memory hierarchy is used to allow multiple memory requests to be serviced simultaneously. Multiporting is achieved by having multiple memory banks or making multiple serial requests during the same cycle.

“Store address buffers” are used to make sure memory operations don’t violate hazard conditions. Store address buffers contain the addresses of all pending store operations.

Page 31: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Memory Hazard Detection

Page 32: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Instruction Reorder & Commit When an instruction is “committed”, its result is allowed to modify the logical

state of the machine.

The purpose of the commit phase is to maintain the illusion of a sequential execution model.

2 methods

1. The state of the machine is saved in a history buffer. Instruction update the state of the machine as they execute and when there is a problem, the state of the machine can be recovered from the history buffer. The commit phase gets rid of the history state that’s no longer needed.

2. The state of the machine is separated into a physical state and a logical state. The physical state is updated in memory as instructions complete. The logical state is updated in a sequential order as the speculative status of instructions is cleared. The speculative state is maintained in a reorder buffer and during the commit phase, the result of an operation is moved from the reorder buffer to a logical register or memory.

Page 33: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

The Role of Software

Superscalars can be made more efficient if parallelism in software can be increased.

1. By increasing the likelihood that a group of instructions can be issued simultaneously

2. By decreasing the likelihood that an instruction has to wait for the result of a previous instruction

Page 34: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

A Look At 3 Superscalar Processors

1. MIPS R10000

2. DEC Alpha 21164

3. AMD K5

Page 35: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

MIPS R10000

“Typical” superscalar processor Able to fetch 4 instructions at a time Uses predecode to generate bits to assist with branch prediction (512 entry

prediction table) Resume cache is used to fetch “not taken” instructions and has space to handle

4 branch predictions at a time Register renaming uses a physical register file 2x the size of the logical register

file. Physical registers are allocated from a free list 3 instruction queues – memory, integer, and floating point 5 functional units (an address adder, 2 integer ALU’s, a floating point multiplier /

divider / square rooter, & floating point adder) Supports on-chip primary data cache (32 KB, 2 way set associative) and an off-

chip secondary cache. Uses reorder buffer mechanism to maintain machine state during execptions. Instructions are committed 4 at a time

Page 36: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Alpha 21164 Simple superscalar that forgoes the advantage of dynamic

scheduling in favor of a high clock rate 4 Instructions at a time are fetched from an 8K instruction cache 2 instruction buffers that issue instructions in program order Branches are predicted using a history table associated with the

instruction cache Uses the single queue method of instruction issuing 4 functional units (2 ALUs, a floating point adder, and a floating

point multiplier) 2 level cache memory (primary 8K cache & secondary 96 K

3way set associative cache) Sequential machine state is maintained during interrupts

because instructions are not issued out of order The pipeline functions as a simple reorder buffer since

instructions in the pipeline are maintained in sequential order

Page 37: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Alpha 21164 Superscalar Organization

Page 38: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

AMD-K5

Implements the complex Intel x86 instruction set Use 5 pre-decode bits for decoding variable length instructions Instructions are fetched from the instruction cache at a rate of 16 bytes

/ cycle & placed in a 16 element queue. Branch prediction is integrated with the instruction cache. There is 1

prediction entry per cache line. Due to instruction set complexity, 2 cycles are required to decode

1. Instructions are converted to ROPS (simple risc like operations)2. Instructions read operand data & are dispatched to functional unit reservation

stations There are 6 functional units: 2 integer ALUs, 1 floating point unit, 2

load/ store units & a branch unit. Up to 4 ROPs can be issued per clock cycle Has an 8K data cache with 4 banks. Dual load/ stores are allowed to

different banks. 16 entry reorder buffer maintains machine state when there is an

exception and recovers from incorrect branch predictions

Page 39: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

AMD K5 Superscalar Organization

Page 40: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

The Future of Superscalar Processing Superscalar design = performance gain

BUT increasing hardware parallelism may be a case of diminishing returns.

1. There are limits to instruction level parallelism in programs that can be exploited.

2. Simultaneously issuing more instructions increases complexity and requires more cross checking. This will eventually affect the clock rate.

3. There is a widening gap between processor and memory performance4. Many believe that the 8-way superscalar is the limit and that we will reach

this limit within 2 years.

Some believe VLIW will replace superscalars and offers advantages1. Because software is responsible for creating the execution schedule, the

size of the instruction window that can be examined for parallelism is larger than a superscalar can do in hardware

2. Since there is no dependence checking by the processor VLIW hardware is simpler to implement and may allow a faster clock.

Page 41: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

Reference

The Microarchitecture of Superscalar Processors by James E. Smith, IEEE and Gurindar S. Sohi, senior member, IEEE

Page 42: Superscalar Processors by Sherri Sparks. Overview 1.What are superscalar processors? 2.Program Representation, Dependencies, & Parallel Execution 3.Micro.

?