Top Banner
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University
28

Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

Dec 22, 2015

Download

Documents

Nadia Prewitt
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

Exploring Wakeup-Free Instruction Scheduling

Jie S. Hu, N. Vijaykrishnan, and Mary Jane IrwinMicrosystems Design LabThe Pennsylvania State University

Page 2: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

2

Outline

Motivation Case study: Cyclone Towards high-performance wakeup-

free scheduler A general model Employing pre-check scheme A segmented issue queue

Conclusions and future work

Page 3: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

3

Superscalar Issue Queue

rdyLopd tagL

opd tagRrdyR

rdyLopd tagL

opd tagRrdyR

==

==

OR

OR

tag1

tagIW

instN-1 inst0

Wakeup LogicDelay = Ttagdrive + Ttagmatch + TmatchOR

Ttagdirve = c0 + (c1 + c2xIW)xN + (c3 + c4xIW + c5xIW2)xN2

Ttagmatch ,TmatchOR = c0 + c1xIW + c2xIW2

S. Palacharla et al., ISCA24

Page 4: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

4

Superscalar Issue Queue

Selection LogicTselection = c0 + c1xlog4N

S. Palacharla et al., ISCA24

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

Issue Queue

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

enb

from/to other subtrees

root cell

Page 5: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

5

Challenges in Dynamic Instruction Scheduling

Broadcast-based dynamic scheduler Higher complexity Power hungry A major limiter to clock frequency: increasing issue queue size, issue

width, wire delay, and shorten logic levels per pipeline stage Complexity Effective Issue

Speculative wakeup [Stark et.al.] Dependency chain based ordering [Canal/Gonzalez ICS 00//01;

Michaud/Seznec HPCA01; Segmented Issue queue [Raasch et.al. ISCA 2002] Wakeup-free dynamic scheduler [Ernst ISCA 2003 et.al.]

Lower complexity Lower power consumption Better scalability Have to trade performance loss

Page 6: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

6

Our Goals

Explore the predictability of instruction issue latency

Identify the performance impediments in wakeup-free architectures

Design high-performance wakeup-free schedulers

Page 7: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

7

Cyclone: Conflict in the Main Queue

FP benchmarks Int benchmarks

Order Enforced

Enforce ordered placement to avoid conflict between instructions with different latencies

Page 8: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

8

Possible Structural Problems

Instruction promotion/forwarding incurs conflict along the path

Very limited instruction pool for selection Only entries in column 0 in the main queue can be issued Ready instructions (not in column 0) are delayed due to

conflict Limited number of issue ports has less tolerance to

mispredicted ready instructions Waste issue port Prevent ready instruction from issue Complete with newly decoded instructions due to replay

Page 9: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

9

A General Model: WF-Replay

lat lat lat

Wakeup-FreeIssue Queue

lat lat lat

register file ready bits

replay?

Rename

Pre-schedule

From

decoder

Timing Table

to FU

s

Selection Logic

from FUs

Collapsing issue queue without

promotion. Conventional random selection logic

Given much wider issue width

How to relax the structural constraints?

Instruction is removed if no

replay is needed

Page 10: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

10

Instruction Pre-scheduling

I0

I1

I2

I3

Rename/PSCHED0

max

max

+

reschedule?

Tim

ing Table

PSCHED1

max

+

max

+

depcheck

MUX control

Register M

apping Table

+

lat0

lat1

lat2

lat3

Adapted from Cyclone, D. Ernst et. al., ISCA’03

Page 11: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

11

Latency Triggered Selection

lat lat lat

Wakeup-FreeIssue Queue

lat lat latlat

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

lat

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

enb

root cell

Page 12: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

12

WF-Replay IPC (F4-I8 vs F4-I4)

Issue Width: 8 Issue Width: 4

WF-Replay loses 9.7% performance (IPC) to Base as the issue width reduces to 4 instruction per cycle

Page 13: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

13

Competition at Issue Ports?

Issue Width: 8 Issue Width: 4

Page 14: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

14

Precheck to Avoid Competition

Competition at issue port may delay ready (predictive) instructions

Delayed instructions may again compete with instructions dependent on them

Causing more instructions falsely ready or to be delayed

Wider issue port can avoid unnecessary competition at cost of higher complexity

Solution: preventing falsely ready instructions from selection by pre-checking register ready bits

Page 15: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

15

WF-Precheck Scheduler

lat

Wakeup-FreeIssue Queue

Issuing

Rename

Pre-schedule

From

decoder

Timing Table

to FU

s

Register Ready Bit Registerfrom Mem.

Selection Logic

ry latry latry latry latry latry

Precheck register ready bits when predicted latency

reaches 0

Selection request is filtered by ‘ry’ bit

Trade replay for pre-check

Only issue truly ready instructions

Page 16: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

16

Complexity of Pre-checking

On the average, 40.2% instructions have both source operands ready and 45.4% instructions have one source operand ready at pre-schedule stage.

Pre-check request is less than 2 per cycle.

Page 17: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

17

Issue Port Competition (F4-I4)

Page 18: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

18

WF-Precheck IPC (F4-I4)

Page 19: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

19

Impact of Load Related Predictions

Page 20: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

20

How about Selection Logic?

Selection LogicTselection = c0 + c1xlog4N

S. Palacharla et al., ISCA24

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

req enb

Issue Queue

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3req enb

req

0 gra

nt

0 req

1 gra

nt

1 req

2 gra

nt

2 req

3 gra

nt

3

enb

from/to other subtrees

root cell

Page 21: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

21

WF-Segment Issue Queue

Selection Logic

ry ry ry ry ry ry ry ry

4 issue ports

to FUsD

ispatch Routing

0

1-2

3-4

>4

Renam

e / Pre-

schedulingT

ime

Table

Re

giste

r R

ea

dy

Bits

from F

Us

Mem

.

from

decoder

Sw

itchback path

Page 22: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

22

WF-Segment Issue Queue

On the average, WF-Segment trades 3% IPC loss to WF-Precheck and 5% loss to the Base for optimizing selection logic.

Page 23: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

23

Conclusions

Explore and identify the performance impediments in wakeup-free scheduling

High-performance wakeup-free dynamic schedulers WF-Replay: eliminates structural constraints WF-Precheck: avoids unnecessary competition at

issue ports WF-Segment: optimizes selection logic for high

clock speed

Page 24: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

24

Future Work

Routing complexity analysis in WF-Segment scheduler

Power analysis for wakeup-free schedulers

Sophisticated pre-scheduler

Page 25: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

25

Page 26: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

26

Wire Delay Challenges

Increasing pipeline depth for high performance

Clock period (FO4) decreases dramatically

Cross-chip wire delay will be up to 10 cycles as technology shrinks

M. S. Hrishikesh et al, ISCA29

Stephen W. Keckler et al, ISSCC’03

Page 27: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

27

Precheck as A Single Stage

Page 28: Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.

28

Load/Store Dependence Predictor