Top Banner
Instruction Level Parallelism 2. Superscalar and VLIW processors
49

Instruction Level Parallelism 2. Superscalar and VLIW processors.

Dec 14, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Instruction Level Parallelism

2. Superscalar and VLIW processors

Page 2: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Superscalar and VLIW Processors

• Scalar processors fetch and issue max 1 operation in each clock cycle.

• Multiple-issue processors:• Superscalar (issue a varying number of

instructions at each clock cycle).• VLIW (issue a fixed number of

instructions at each clock cycle).

Vittorio Zaccaria – Alari @ ST 2001

Page 3: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Superscalar Processors• Issues from 1 to 8 instructions at each

clock cycle.• If instructions are dependent, only the

instructions preceding that one are issued (in-order issue).

• This decision is made at run-time by the processor.

=> Variability in the issue rate.

Page 4: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Superscalar Processors

Can be:• Statically scheduled:

Do not allow (issue) instructions behind stalls to proceed or

• Dynamically scheduled and speculative (allow instructions behind RAW hazards to proceed).

Page 5: Instruction Level Parallelism 2. Superscalar and VLIW processors.

How to optimize code for Superscalar Processors (1)

Loop: LD F0,0(R1)ADDD F4,F0,F2SD 0(R1),F4SUBI R1,R1,#8BNEZ R1,LOOP

The loop is unrolled 4 times (load/addd/store) in which RAW hazards have been reduced, but there are resource conflicts on the pipelines...

Loop: LD F0,0(R1)LD F6,-8(R1)LD F10,-

16(R1)LD F14,-

24(R1)ADDD F4,F0,F2ADDD F8,F6,F2ADDD F12,F10,F2ADDD F16,F14,F2SD 0(R1),F4SD -8(R1),F8SD -

16(R1),F12SUBI R1,R1,#32BNEZ R1,LOOPSD (R1),F16; 8-32 = -24

Page 6: Instruction Level Parallelism 2. Superscalar and VLIW processors.

How to optimize code for Superscalar processors (2)

Integer instruction FP instruction

Loop: LD F0,0(R1) //LD F6,-8(R1) //LD F10,-16(R1) ADDD F4,F0,F2LD F14,-24(R1) ADDD F8,F6,F2LD F18,-32(R1) ADDD F12,F10,F2SD 0(R1),F4 ADDD F16,F14,F2SD -8(R1),F8 ADDD F20,F18,F2SD -16(R1),F12 //SD -24(R1),F16 //SUBI R1,R1,#40 //BNEZ R1,LOOP //SD -32(R1),F20 //

• 5 times unrolled loop.

Page 7: Instruction Level Parallelism 2. Superscalar and VLIW processors.

The PowerPC 620 [’94]

• Superscalar Architecture Similar to:• MIPS R10000• HP PA 8000

• Fetch, issue and completion of up to 4 instructions per clock cycle.

• Six separate execution units buffered with reservation stations.

Page 8: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC functional units

• 2 integer units (XSU0, XSU1), 0 cycles latency [+,-,shift..]

• 1 complex integer function unit MCFXU for integer (pipelined * , unpipelined /). Latency from 3 to 20 cycles).

• 1 Load store unit. Latency=1 for integer loads, 2 for FP loads.

Page 9: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC functional units• 1 FPU with latencies of:

• 2 cycles for multiply,add, multiply-add • 31 for DP FP divide. (fully pipelined

except for divide).• 1 BRU, completes branches and

informs the fetch unit of mispredictions. Includes the condition register used for conditional branches.

Page 10: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC Architecture• Speculative Tomasulo with register

renaming. Extendend register file holds speculative result of an instruction until the instruction commits.

• The ROB enforces only in-order commit.• Advantages:

operands are available from a single location (no need for additional complex logic to access ROB result values)

Page 11: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC 620 architecture

Page 12: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC Pipeline

• Fetch:The Fetch unit loads the decode queue with instructions from the cache. Next address is predicted through a 256-entry, two-way set associative BTB.A BPB is used if there is a miss in the BTB.

Page 13: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC Pipeline• Instruction decode:

Instructions are decoded and inserted into an 8-entry instruction queue.

• Instruction Issue:4 Instructions are taken from the 8-entry instruction queue and are issued to the RS. Allocate a rename register and a reorder buffer entry for the instruction issued. If we can’t, stall.

Page 14: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC Pipeline

• Execution:Proceeds with execution when all operands are available. At the end, the result is written on the result bus. The completion unit is notified that the instruction has completed.

Page 15: Instruction Level Parallelism 2. Superscalar and VLIW processors.

PowerPC Pipeline• If the instruction is a (mispredicted)

branch, IFU and IC(ompletion)U are notified. Instruction fetch restarts, and ICU discards all the speculated instructions after the branch and free the rename buffers.

• Commit:When all previous instructions have been committed, commit the result into the RF and free the rename buffer. Stores also commit from store buffer to memory.

Page 16: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Performance results

• IPC from under 1 to 1.8.• We do not reach IPC=4 due to:

• Fus are not replicated for each instruction (structural hazards)

• Limited instruction level parallelism or limited buffering (insufficient buffers).

Page 17: Instruction Level Parallelism 2. Superscalar and VLIW processors.

P6 Processor Family:Intel Pentium II/III

• 3-way superscalar.• Basic Idea, three engines:

Page 18: Instruction Level Parallelism 2. Superscalar and VLIW processors.

P6 Pipeline• Fetch/Decode Unit: decodes instructions and

puts them in the instruction pool in-order.• converts the instructions in micro-ops that represent

instruction code.

• Dispatch/Execute Unit: out-of-order issue from the instruction pool in a reservation station and out-of-order execution of micro-ops.

• Retire UnitReorders the instructions and commits speculative results to the architectural state.

Page 19: Instruction Level Parallelism 2. Superscalar and VLIW processors.

P6 Instruction Decode

• The decoder fetches 16 bytes at each clock cycle from the cache• 3 parallel decoders convert most of the instructions into one or more triadic

micro-ops. Some instruction need microcode (several micro-ops) to be executed.

• Register Alias Table unit converts logical reg. ref. into physical reg. ref. In the ROB (register renaming)

Page 20: Instruction Level Parallelism 2. Superscalar and VLIW processors.

P6 Instruction Dispatch/Execute

• The dispatch unit dispatches out-of-order the microops in the instruction pool through the reservation station unit

• This happens when:• All the operands are ready• The resource needed is

ready.• Maximum throughput: 5

micro-ops/cycle.

If micro-ops are branches, their execution is compared with the predicted address (in the Fetch phase). If mispredicted the JEU changes the status of all the micro-ops behind the branch and removes them from the instruction pool.

Page 21: Instruction Level Parallelism 2. Superscalar and VLIW processors.

P6 Instruction Retire• The retire unit looks for micro-ops that have

been executed and can be removed from the pool.

• The original architectural target of the micro-ops is written.

• This is done in-order by committing an instruction only if:• Previous instructions have been committed• The instruction has been executed.

• Up to 3 micro-ops can be retired at each clock cycle.

Page 22: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Pentium 4

Page 23: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Pentium 4• New NetBurst micro-architecture

• 20 pipeline stages (hyper-pipeline)• 1.4 GHz to 2GHz

• 3 prefetching mechanisms• Harware instruction prefetcher (based on

BTB).• Software controlled data cache prefetching.• L3->L2 data and instruction hardware

prefetcher

Page 24: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Pentium 4• Execution Trace Cache

• TC stores decoded IA-32 instructions or micro-ops.

• Removes decoding costs• 12K micro-ops, 3 micro-ops per cycle fetch

bandwidth• It stores traces built across predicted

branches.• However some instructions need micro-code

from ROM.

Page 25: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Pentium 4• Branch penalty delay can be much more

than 10 cycles• Uses BTB• In case of a miss in the BTB, static

prediction is used (back=T, forw=NT)• Use of software branch hints during the

trace construction that override static prediction.

Page 26: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Pentium 4

• Execution Units and Issue Ports

Page 27: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Pentium 4

• 1 load and 1 store issue for each cycle.

• Loads can be reordered w.r.t. other loads and stores

• Loads can be executed speculatively• Up to 4 outstanding load misses.• Load/store forwarding

Page 28: Instruction Level Parallelism 2. Superscalar and VLIW processors.

AMD Athlon K7• Nine-issue (micro-ops), super-pipelined,

superscalar x86 processor• Multiple x86 instruction decoders (into triadic

micro-ops)• Three out-of-order, superscalar, fully pipelined

floating point execution units.• Three out-of-order, superscalar, pipelined integer

units.• Three out-of-order, superscalar, pipelined address

calculation units.• 72-entry instruction control unit (ROB)

Page 29: Instruction Level Parallelism 2. Superscalar and VLIW processors.

AMD Athlon K7

Page 30: Instruction Level Parallelism 2. Superscalar and VLIW processors.

AMD Athlon K7• The Instruction Control Unit contains a reorder

buffer and distributed reservation stations to hold operands while OP’s wait to be scheduled.

• The Integer Instruction Scheduler is an instruction scheduling logic that picks OP’s for execution based on their operand availability and issues them to functional units or address generation units.

• The function units perform transformations on data and return their results to the reorder buffer, while the address-generation units send calculated memory addresses to the Load/Store Unit for further processing.

Page 31: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Clustered VLIW

Page 32: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Multi-Ported Register File Limits

• Area of the register file grows approximately with the square of the number of ports

Write1

Read1A

Read1B

Dout1BDout1A

Write2

Read2A

Read2B

Dout2BDout2A Dout1BDout1A

Write1

Read1A

Read1B

1 write Port2 Read Ports

2 write Ports4 Read Ports

Bit Cell Bit Cell

Page 33: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Multiported Register File• Read Access time of a register file grows

approximately linearly with the number of ports• Internal Bit Cell loading becomes larger• Larger area of register file causes longer wire

delays• What is reasonable today in terms of

number of ports? • Changes with technology, 15-20 ports is

currently about the maximum (read ports + write ports)

Page 34: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Clustered VLIW• To solve the bottleneck, create partitioned

register files connected to small numbers of Executions Units

Register File

EU

Register File

EU

Register File

EU

Global Bus

Page 35: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Register File Communication

• Architecturally Invisible• Partitioned RFs appear as one large

register file to the compiler• Copying between RFs is done by control• Detection of when copying is needed

can be complicated; goes against VLIW philosophy of minimal control overhead

Page 36: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Register File Communication• Architecturally Visible

• Remote and Local versions of instructions• Explicit copy primitives

• Remote Instructions:• have one or more operands in non-local RF• Copying of remote operands to local RFs takes

clock cycles.• Because copying is ‘atomic’ part of remote

instruction, execution unit is idle while copying is done => performance

Page 37: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Register File Communication

• Copy instructions:• Separation of copy and execution

allows more flexible scheduling by compiler

move r1, r60 //(r60 in another RF)independent instr a //do not waste usefulindependent instr b //clock cycles

add r2, r1, r3

Page 38: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Instruction Comression

• Embedded Processors often put a limit on code size

• How to reduce size?• NOPs are common, use only a few bits

(2-3) to represent a NOP.• Mark explicitly start and stop of the

long instruction and do not insert nop.

Page 39: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Instructions Decompression

• On Instruction Cache fill• ICache has to hold uncompressed

instructions - limits cache size

• On instruction fetch• Decompression in critical path of fetch

stage, may have to add one or more pipeline stages just for decompression

Page 40: Instruction Level Parallelism 2. Superscalar and VLIW processors.

VLIW Architectures:Some real world example

Page 41: Instruction Level Parallelism 2. Superscalar and VLIW processors.

TMS320C6X CPU• 8 independent execution units• Execution unit types:

• L : Integer adder, Logical, Bit Counting, FP adder, FP conversion

• S : Integer adder, Logical, Bit Manipulation, Shifting, Constant, Branch/Control, FP compare

• D : Integer adder, Load-Store• M : Integer Multiplier, FP multiplier

• Split into two identical datapaths, each contains the same four units (L, S, D, M)

Page 42: Instruction Level Parallelism 2. Superscalar and VLIW processors.

TMS320C6X CPU (cont).

• Max clock speed of 200 Mhz• Each datapath has a 16 x 32bit

Register file

16 x 32 RF

L S M D

16 x 32 RF

Global Bus

L S M D

Page 43: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Instruction Encoding• Internal Execution path is 256 bits-wide

• Each operation is 32 bits wide => 8 operations per clock

• A fetch packet is a group of instructions fetched simultaneously. Fetch packet has 8 instructions.

• A execute packet is a group of instructions beginning execution in parallel. Execute packet has 8 instructions

Page 44: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Instruction Encoding

• Instructions in ICache have an associated P-bit (Parallel-bit).• Fetch packet expanded to 1 to 8

Execute packets during fetch stage depending on P-bits

Page 45: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Fetch Packet to Execute Packet Expansion

A|B|C|D|E|F|G|H

0|0|0|0|0|0|0|0

Fetch Packet

P-bits, A-H executed serially

n|n|A|n|n|n|n|n

n|B|n|n|n|n|n|n

n|n|n|n|n|C|n|n

n|n|n|n|n|D|n|n

n|n|n|E|n|n|n|n

F|n|n|n|n|n|n|n

n|n|n|n|n|n|G|n

n|n|n|n|n|n|n|H8 instructions

Execute Packet

64 instructions

Page 46: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Fetch Packet to Execute Packet Expansion

A|B|C|D|E|F|G|H

1|1|0|1|0|0|1|0

Fetch Packet

P-bits

A||B||C, D||E, F, G||H

n|B|A|n|n|C|n|n

n|n|n|E|n|D|n|n

F|n|n|n|n|n|n|n

n|n|n|n|n|n|G|H

P-bit String of ‘1’s followed by ‘0’ means those execute in parallel. String starting with ‘0’ indicates sequential execution.

40 instructions

Page 47: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Philips TM 1000/Multimedia Processor

Page 48: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Philips Trimedia• Five Execution Units => Five operations

per clock issued• 15 Read and 5 Write Ports on register File

• Need 15 read ports for 5 Execution Units because each operation requires two operands and a Guard operand.

• Guard operand makes each operation conditional based upon value of LSB of the guard operand => Predicated Execution.

• 128 Registers (r0, r1 always 0)

Page 49: Instruction Level Parallelism 2. Superscalar and VLIW processors.

Philips Trimedia Instructions

• Multiple operation sizes • 2 bits for NOP, 26 bits, 34 bits, and

44 bits.