Top Banner
Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin David Tarjan*, Michael Boyer, and Kevin Skadron* Skadron* University of Virginia University of Virginia Department of Computer Science Department of Computer Science * Currently on internship/sabbatical at * Currently on internship/sabbatical at NVIDIA Research NVIDIA Research
30

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

Feb 01, 2016

Download

Documents

Binh

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue. David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

David Tarjan*, Michael Boyer, and Kevin David Tarjan*, Michael Boyer, and Kevin Skadron*Skadron*

University of VirginiaUniversity of Virginia

Department of Computer ScienceDepartment of Computer Science

* Currently on internship/sabbatical at NVIDIA * Currently on internship/sabbatical at NVIDIA ResearchResearch

Page 2: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

L2 L2

L2 L2

MotivationMotivation

L2 L2

L2 L2

Homogeneous Heterogeneous

Adaptive(Federation)

Multithreadedscalar IO

core

2-wayOO core

L2 L2

L2 L2

Page 3: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Basic InsightsBasic Insights

A multithreaded in-order core has many A multithreaded in-order core has many registers which can be reused for a reorder registers which can be reused for a reorder buffer orbuffer oractive listactive list

If cores are small, single cycle If cores are small, single cycle communication between neighbors is feasiblecommunication between neighbors is feasible

Prior work on making large OOO cores Prior work on making large OOO cores feasible can be applied at the low end to feasible can be applied at the low end to make low-cost OOO possiblemake low-cost OOO possible

Page 4: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Bpred

Allocate

Rename

Issue

Commit

In-order & Out-of-order PipelinesIn-order & Out-of-order Pipelines

Fetch

Decode

Execute

Mem

Writeback

Fetch

Decode

Execute

Mem

Writeback

In-order Out-of-order

Page 5: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Ready Bits

Subscriber Slot 1

Subscriber Slot 21

2

3

4

5

Issue Queue ExampleIssue Queue Example

1 1 IQ2

1

IQ3

IQ30

0 0

1

1

+

+

+

1

Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002

Sassone et al., Sassone et al., Matrix Scheduler Reloaded, ISCA 2007

1

2

3

Page 6: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Simplified Load-Store QueueSimplified Load-Store Queue

Memory Alias Table (MAT)Memory Alias Table (MAT) No store forwardingNo store forwarding No conservative waiting on storesNo conservative waiting on stores Only detect memory order violations after Only detect memory order violations after

they have occurred and flush the pipeline they have occurred and flush the pipeline when the offending instruction commitswhen the offending instruction commits

Amir Roth, Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005

Page 7: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

MAT ExampleMAT Example

st 0x13, r5ld r1, 0x13

0

0

0

0

0

0

0

0

MAT

0

1

2

3

4

5

6

7

Page 8: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

MAT ExampleMAT Example

st 0x13, r5ld r1, 0x13

EXE

0

0

0

1

0

0

0

0

MAT

0

1

2

3

4

5

6

7

ld executes and increments counter

Page 9: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

MAT ExampleMAT Example

st 0x13, r5

COM

0

0

0

1 !

0

0

0

0

MAT

0

1

2

3

4

5

6

7

ld r1, 0x13

st commits and sets flag

Page 10: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

MAT ExampleMAT Example

ld r1, 0x13

COM

0

0

0

1 !

0

0

0

0

MAT

0

1

2

3

4

5

6

7

Flush

ld commits, sees flag, and flushes pipeline

Page 11: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

MAT ExampleMAT Example

ld r1, 0x13

0

0

0

0

0

0

0

0

MAT

0

1

2

3

4

5

6

7

MAT is reset and execution resumes

Page 12: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Performance ImpactPerformance Impact

0.00%

2.67%

1.71%

5.46%

0%

1%

2%

4%

5%

6%

consumer-basedissue queue

pseudo-randomscheduling

MAT commit-time branchrecovery

Ave

rag

e IP

C L

oss

Page 13: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

PerformancePerformance

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Scalar IO 2-way IO FederatedOO

2-way OO 4-way OO

Ave

rag

e IP

C

spec specint specfp

Page 14: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Energy EfficiencyEnergy Efficiency

0

0.5

1

1.5

2

2.5

Scalar IO 2-way IO FederatedOO

2-way OO 4-way OO

No

rmal

ized

BIP

S^

3/W

att

spec specint specfp

Page 15: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Area EfficiencyArea Efficiency

0

0.2

0.4

0.6

0.8

1

1.2

Scalar IO 2-way IO FederatedOO

2-way OO 4-way OO

No

rmal

ized

BIP

S^

3/(W

att*

mm

^2)

spec specint specfp

Page 16: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

ConclusionsConclusions

Two in-order cores can be federated at run-Two in-order cores can be federated at run-time to form a 2-way OO coretime to form a 2-way OO core

Almost doubling IPC of throughput core is Almost doubling IPC of throughput core is possible with very little extra hardwarepossible with very little extra hardware

Don’t want traditional OO structures because Don’t want traditional OO structures because their performance comes at too high a pricetheir performance comes at too high a price

Best combined area- and energy-efficiencyBest combined area- and energy-efficiency

Page 17: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Q & AQ & A

Page 18: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

BackupBackup

Page 19: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Core Fusion DataCore Fusion Data

Figure from Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007

Page 20: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Overall ResultsOverall Results

Scalar in-order core is 8KB I/D, 256KB L2Scalar in-order core is 8KB I/D, 256KB L2 Base 2-way core has 16KB I and D-Caches, Base 2-way core has 16KB I and D-Caches,

256KB L2, 32 entry ROB, 16 entry issue 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpredqueue, 16 entry LSQ, bimodal bpred

4-way core is 32KB I/D, 2MB L2, 128 entry 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpredROB, 32 IQ and LSQ, tournament bpred

Page 21: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Branch PredictionBranch Prediction

Use only a Next Line and Set (NLS) predictor, Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Bimodal predictor and a Return Address Stack (RAS)Stack (RAS)

NLS ok if your instruction working set not > I$ NLS ok if your instruction working set not > I$ sizesize

Small bimodal predictor ik ok for small Small bimodal predictor ik ok for small window processorwindow processor

Page 22: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

FetchFetch

Two I$’s act as a I$ of twice the size and Two I$’s act as a I$ of twice the size and associativity (and random replacement)associativity (and random replacement)

More logic and buffers to capture two More logic and buffers to capture two instructions instructions

Extra cycle to route instructions from two I$’s Extra cycle to route instructions from two I$’s to two decoders to two decoders

Page 23: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

DecodeDecode

Cancel second instruction if first turns out to Cancel second instruction if first turns out to be branchbe branch

Extra cycle to route decoded instructions to Extra cycle to route decoded instructions to new allocate stagenew allocate stage

Page 24: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

AllocateAllocate

New logic and free lists to allocate ROB, IQ New logic and free lists to allocate ROB, IQ entriesentries

Page 25: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

RenameRename

New table since it has too many portsNew table since it has too many ports One, centralized rename table, not One, centralized rename table, not

distributeddistributed Has separate table (or field in each RAT Has separate table (or field in each RAT

entry) for each registers producer entry) for each registers producer instructions IQ-slot number (see our new instructions IQ-slot number (see our new issue queue)issue queue)

Page 26: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

IssueIssue

Uses a simple lookup table as wakeup Uses a simple lookup table as wakeup structure, where instructions subscribe to structure, where instructions subscribe to their input instructions (explained in detail their input instructions (explained in detail later)later)

Centralized, one IQ for the two coresCentralized, one IQ for the two cores

Page 27: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Register File Register File

Register file is mirrored in the two coresRegister file is mirrored in the two cores No extra copy instructions or load-balancing No extra copy instructions or load-balancing

questionsquestions

Page 28: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

ExecuteExecute

Add extra cycle for copying result to other Add extra cycle for copying result to other core’s register file (like EV6)core’s register file (like EV6)

Page 29: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

Memory AccessMemory Access

The two D$s are checked in parallel, each The two D$s are checked in parallel, each responsible for half of the merged D$’s waysresponsible for half of the merged D$’s ways

No standard LSQ, only a Memory Alias Table No standard LSQ, only a Memory Alias Table (details later)(details later)

Only detects ordering violations and send Only detects ordering violations and send signal to pipelinesignal to pipeline

Page 30: Federation:  Repurposing Scalar Cores for Out-of-Order Instruction Issue

CommitCommit

Centralized commit, no slippageCentralized commit, no slippage Recover from branch mispredictions since no Recover from branch mispredictions since no

checkpoints of RAT on branchescheckpoints of RAT on branches Recover from memory order violations (or Recover from memory order violations (or

false positives) from MATfalse positives) from MAT