Top Banner
INSTRUCTION LEVEL PARALLELISM AND ITS E (P 3) CPE731 - Dr. Iyad Jafar EXPLOITATION (PART 3) Chapter 3 Appendix H 1
25

INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

Jul 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

INSTRUCTION LEVEL

PARALLELISM AND ITS

E (P 3)

CP

E731 -

Dr. Iy

ad

Jafa

r

EXPLOITATION (PART 3)Chapter 3

Appendix H

1

Page 2: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

OUTLINE

� Dynamic Scheduling, Multiple Issue and

Speculation (3.8)

Advanced Techniques for Instruction Delivery and � Advanced Techniques for Instruction Delivery and

Speculation (3.9)

� Multithreading (3.12)

CP

E731 -

Dr. Iy

ad

Jafa

r

2

Page 3: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

DYNAMIC, MULTIPLE ISSUE AND SPECULATION

� Microarchitecture that is used in modern processors

� Issuing multiple instructions dynamically is complex dueto dependency!

� The key is assigning a reservation station and updatingthe pipeline control tables

� Approaches� Issue one instruction in each half of the cycle � suitable for

two-issue� Build the logic necessary to handle two or more

instructions at once

CP

E731 -

Dr. Iy

ad

Jafa

r

instructions at once� Hybrid!

� This “issue step” is one of the most fundamentalbottlenecks

� Multiple completion/commit! 3

Page 4: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

DYNAMIC, MULTIPLE ISSUE AND SPECULATION

� We will consider a simple implementation� Issue rate of 2 instructions per cycle

� Extend Tomasulo to support multiple-issue� Extend Tomasulo to support multiple-issue

superscalar pipeline with integer, load/store and FP

units that can initiate an operation every cycle

� Instructions are issued in order!

� The pipeline issues any combination of two

instructions each cycle using scheduling hardware

CP

E731 -

Dr. Iy

ad

Jafa

r

instructions each cycle using scheduling hardware

� Issue and completion logic is enhanced to allow

multiple instruction to issue and process each cycle

� All datapaths are widened to allow multiple issue4

Page 5: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

DYNAMIC, MULTIPLE ISSUE AND SPECULATIONC

PE

731 -

Dr. Iy

ad

Jafa

r

5

Page 6: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

DYNAMIC, MULTIPLE ISSUE AND SPECULATION

� Example: Consider the execution of the following loop which

increments each element of an integer array on a two-issue

processor, once without speculation and once with speculation.

CP

E731 -

Dr. Iy

ad

Jafa

r

Loop: LD R2, 0(R1)

DADDIU R2, R2, #1

SD R2, 0(R1)

DADDIU R1, R1, #8

BNE R1, R3, Loop

Assume there are separate integer functional units for effective

address calculation, for ALU operations and for branch

6

address calculation, for ALU operations and for branch

condition evaluation.

Create a table for the first two iterations of this loop for both

processors. Assume two instructions of any type can commit

per cycle.

Page 7: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

DYNAMIC, MULTIPLE ISSUE AND SPECULATION

� No Speculation

Iter. Instructions Issue

Cycle

Execute

Cycle

Memory

Cycle

Write

Cycle

1 LD R2, 0(R1) 1 2 3 4

CP

E731 -

Dr. Iy

ad

Jafa

r

1 LD R2, 0(R1) 1 2 3 4

1 DADDIU R2,R2, #1 1 5 6

1 SD R2, 0(R1) 2 3 7

1 DADDIU R1, R1, #8 2 3 4

1 BNE R1, R3, Loop 3 5

2 LD R2, 0(R1) 6 7 8 9

7

2 DADDIU R2,R2, #1 6 10 11

2 SD R2, 0(R1) 7 8 12

2 DADDIU R1, R1, #8 7 11 12

2 BNE R1, R3, Loop 8 13

Page 8: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

DYNAMIC, MULTIPLE ISSUE AND SPECULATION

� With Speculation

Iter. Instructions Issue

Cycle

Execute

Cycle

Memory

Cycle

Write

Cycle

Commit

Cycle

1 LD R2, 0(R1) 1 2 3 4 5

CP

E731 -

Dr. Iy

ad

Jafa

r

1 DADDIU R2,R2, #1 1 5 6 7

1 SD R2, 0(R1) 2 3 7

1 DADDIU R1, R1, #8 2 3 4 8

1 BNE R1, R3, Loop 3 5 8

2 LD R2, 0(R1) 4 5 6 7 9

2 DADDIU R2,R2, #1 4 8 9 10

8

2 DADDIU R2,R2, #1 4 8 9 10

2 SD R2, 0(R1) 5 6 10

2 DADDIU R1, R1, #8 5 6 7 11

2 BNE R1, R3, Loop 6 8 11

Page 9: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

ADVANCED TECHNIQUES FOR INSTRUCTION

DELIVERY AND SPECULATION

� Multiple-issue processors require high bandwidth

instruction stream

Widen paths to instruction cache!� Widen paths to instruction cache!

� Branches are difficult!

� Increasing Instruction Fetch Bandwidth

� Branch-Target Buffer

� Return Address Predictors

� Integrated Instruction Fetch Units

CP

E731 -

Dr. Iy

ad

Jafa

r

� Speculation: Implementation Issues and Extensions

� Register Renaming versus Reorder Buffers

� How Much to Speculate?

� Value Prediction! 9

Page 10: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

INCREASING INSTRUCTION FETCH BANDWIDTH

� Branch-Target Buffer� Reduced branch penalty if we know that the yet undecoded

instruction is a branch as well as knowing the branch address

� Zero branch penalty

� Branch-target buffet (BTB)!� Branch-target buffet (BTB)!

CP

E731 -

Dr. Iy

ad

Jafa

r

10

Page 11: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Branch-Target Buffer

INCREASING INSTRUCTION FETCH BANDWIDTHC

PE

731 -

Dr. Iy

ad

Jafa

r

11

Page 12: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Branch-Target Buffer

INCREASING INSTRUCTION FETCH BANDWIDTHC

PE

731 -

Dr. Iy

ad

Jafa

r

12

Page 13: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Branch-Target Buffer� One possible variation

� Allow the BTB to store the target instruction(s) instead

INCREASING INSTRUCTION FETCH BANDWIDTH

of or in addition to the predicted target address

� We skip the IF of the next instruction!

� CPI for branch (unconditional and sometimes

conditional) is 0?

�Branch folding!

CP

E731 -

Dr. Iy

ad

Jafa

r

� Check example on p. 205

13

Page 14: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Return Address Predictors� Indirect jumps

� Switch, Case, indirect procedure calls and procedure returns� Destination address varies at runtime� Hard to predict

INCREASING INSTRUCTION FETCH BANDWIDTH

� For SPEC95� 15% of branches are procedures returns� focus on procedure returns

� Use the BTB � low accuracy if called from multiplesites� <60% accuracy in SPEC CPPU95

� Use a small buffer that stores return addresses as

CP

E731 -

Dr. Iy

ad

Jafa

r

� Use a small buffer that stores return addresses asstack! � RAS� A small buffer that caches the most recent return addresses!� A call pushes the return address to stack� A return pops the return address to stack� LIFO !

� Intel Core processors and the AMD Phenom processors

14

Page 15: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Return Address Predictors

INCREASING INSTRUCTION FETCH BANDWIDTHC

PE

731 -

Dr. Iy

ad

Jafa

r

15

Page 16: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Integrated Instruction Fetch Units� In multiple-issue, IF is not a simple as in a single

pipeline

INCREASING INSTRUCTION FETCH BANDWIDTH

� Implement the instruction fetch unit as a

separate autonomous unit that feeds the

instructions to the rest of the pipeline

� The unit includes

� Integrated branch prediction

Instruction prefetch

CP

E731 -

Dr. Iy

ad

Jafa

r

� Instruction prefetch

� Instruction memory access and buffering

16

Page 17: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

�Explicit Register Renaming vs. Reorder

Buffer� The values of architecturally visible registers are

distributed between actual registers, reservation

SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS

distributed between actual registers, reservation

stations and ROB � complicates scheduling!

� Register renaming� Decouple renaming from scheduling!

� A single and large set of physical registers to hold both

architectural registers and temporary values

� A physical register is allocated for every instruction that

writes with the aid of a HW renaming map

This Allows data to be fetched from single register file

CP

E731 -

Dr. Iy

ad

Jafa

r

� This Allows data to be fetched from single register file

� No need to bypass values from reorder buffer

� Balancing pipeline

� Still need ROB to commit in-order!17

Page 18: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

�How much to speculate?� Speculation helps reducing stalls!

� Cost? time, area, energy and recovery from

incorrect speculation!

SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS

incorrect speculation!

� Performance??

� What if a speculative instruction results in

expensive exception (TLB or cache miss)?� Most speculative processors allow low cost exceptions to be

handled in speculative mode!

� Otherwise, wait until the instruction is no longer speculative

before serving the exception

Efficient with programs with high exception

CP

E731 -

Dr. Iy

ad

Jafa

r

� Efficient with programs with high exception

frequencies coupled with inefficient branch

predictions.

� Degrade performance of other programs.18

Page 19: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Speculating through Multiple Branches

� So far, we have considered the case in which we

need to speculate a single branch instruction

before the need to speculate another one!

SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS

� We may need to speculate on multiple branches� High branch frequency

� Significant clustering of branches

� Slow functional units

� However, speculation through multiple branches

CP

E731 -

Dr. Iy

ad

Jafa

r

� However, speculation through multiple branches

complicates speculation recovery� Until 2011, no processor implemented speculation

through multiple branches per cycle

19

Page 20: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Speculating and Energy Efficiency� It might be argued that speculation decreases power

efficiency� Speculated instructions consumes energy

� Unrolling incorrect speculation

SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS

� Unrolling incorrect speculation

� However, if speculation lowers execution time by more

than it increases average power, then the total energy

could be less� i.e. speculation is capable of improving performance

CP

E731 -

Dr. Iy

ad

Jafa

r

FP

20

Page 21: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Value Prediction

� Attempt to predict the value that will be produced by aninstruction

� Limited success in general!

� How about

SPECULATION: IMPLEMENTATION ISSUES AND EXTENSIONS

How about

� A load that loads from a constant pool?

� A load that loads a value that changes infrequently?

� An instruction that produces a value chosen from a set of potentialvalues?

� No sufficient results to encourage actual incorporation inprocessors

CP

E731 -

Dr. Iy

ad

Jafa

r

� Address Alias Prediction

� Predicts whether two stores or a load and a store refer to the sameaddress, i.e. we don’t predict the address!

� If such reference don’t refer to the same address, they can beinterchanged. Otherwise, wait!

� Simple and stable and used several processors 21

Page 22: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� ILP is transparent and efficient; however, it

can be� Quite difficult to exploit for some applications

� Off-chip cache misses are less likely to be hidden

� Can we use other levels of parallelism?

MULTITHREADING

� Can we use other levels of parallelism?� Online transaction systems have multiple concurrent

queries

� Scientific applications have natural parallelism

� OS runs multiple active applications

� Thread-level parallelism� Allows multiple threads to share the functional units of a

single processor in an overlapping fashion

CP

E731 -

Dr. Iy

ad

Jafa

r

single processor in an overlapping fashion

� Most of the processor core is shared (Cache, TLB … )

� Requires duplicating state elements (separate registers,

page tables and PC for each thread)

� HW should switch between threads quickly

� OS should be optimized and aware! 22

Page 23: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

� Fine-grained Multithreading� Switch on every cycle in an interleaved round-robin fashion� Hide both short and long stalls� Improves throughput, but slows down the execution of a

single thread (Latency)Sun Niagara processor and GPUs

MULTITHREADING

� Sun Niagara processor and GPUs

� Corse-grained Multithreading� Switch on costly stalls only� Less likely to slow down the execution of any thread� Limited ability to overcome throughput losses!� Research community only!

Simultaneous Multithreading (SMT)

CP

E731 -

Dr. Iy

ad

Jafa

r

� Simultaneous Multithreading (SMT)� Variation of fine-grained when implemented on multiple-

issue with dynamic scheduling processor� Issue multiple instructions from multiple threads every

CPU cycle� Intel Hyper Threading (HT) Technology

23

Page 24: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

MULTITHREADING

Tim

e (p

roce

ssor

cyc

le) Superscalar Fine-Grained Coarse-Grained Multiprocessing

Simultaneous

Multithreading

CP

E731 -

Dr. Iy

ad

Jafa

rT

ime

(pro

cess

or c

ycle

)

24

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Idle slot

Page 25: INSTRUCTION EVEL PARALLELISMANDITS E (P 3) XPLOITATION … · 2017-03-26 · Multiple-issue processors require high bandwidth instructionstream Widen paths to instruction cache! Branchesaredifficult

�Further investigation� Security !

� Power !

� Thread Scheduler!

MULTITHREADING

� Thread Scheduler!

� Super threading !

� Read pages 226-232� Effectiveness of Fine-Grained Multithreading on the

Sun T1

� Effectiveness of Simultaneous Multithreading on

Superscalar Processors

CP

E731 -

Dr. Iy

ad

Jafa

r

Superscalar Processors

�Read Section 3.13 “Putting it All

Together”25