Top Banner
Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1 1 Intel 2 Portland State University
26

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Dec 16, 2015

Download

Documents

Lindsay Green
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Scalable Load and Store Processing in Latency Tolerant Processors

Amit Gandhi1,2

Haitham Akkary1

Ravi Rajwar1

Srikanth T. Srinivasan1

Konrad Lai1

1Intel2Portland State University

Page 2: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

2

Problem: tolerating miss latencies

• Increasing miss latencies to memory– large instruction windows tolerate latencies– naïve window scaling impractical

• Resource efficient large instruction windows– sustain 1000s of instructions in-flight– need small register files and schedulers– do not address memory buffers efficiency

Must track all memory operationsMemory consistency, ordering, and forwarding

Page 3: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

3

Why is this a problem?

• Memory operations tracked in load & store buffers– buffers require CAMs for scanning and matching– CAMs have high area and power requirements

• Don’t always need large memory buffers– L2 cache hit small buffers sufficient– L2 cache miss large buffers necessary

• Scaling CAM is difficult• Why pay the price when not necessary?

Must eliminate CAMs from large buffers

Page 4: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

4

Loads: Unordered buffer

• Hierarchical load buffers• Conventional level one load buffer

– effective in the absence of a miss

• Un-ordered level two load buffer– used only when long latency miss occurs– set-associative cache structure

• no scan, only indexed lookup necessary

– does not track precise order of loads• sufficient to know if violation occurred (not where)• checkpoint rollback

Page 5: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

5

Stores: CAM-free buffers

• Hierarchical store queue• Conventional level one store queue

– effective in the absence of a miss

• CAM-free level two store queue– used only when long latency miss occurs– used only for ordering

no scanning or matching necessary in queue

Decouple ordering from forwarding

1. Redo stores to enforce order2. Forward from cache instead of queue

Page 6: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

6

Outline

• Motivation• Resource efficient processors

– Continual Flow Pipelines– memory buffer demands

• Store processing• Results• Summary

Page 7: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

7

Implications of a miss

• Long latency misses to memory– place pressure on critical resources– pipeline quickly stalls due to blocked resources

• Large instruction window processors– execute useful instructions in shadow of miss– tolerate latency by overlapping miss with useful work– naïve scaling impractical

• Resource-efficient instruction windows– scale window to thousands– do not require scaled cycle-critical structures

Page 8: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

8

Resource-efficient latency tolerance

Significant fraction of instructions in the shadow of a miss are independent of the miss

Exploit above program property

Treat and process miss-dependent and miss-independent instructions differently

Page 9: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

9

Continual Flow Pipeline processor

• Miss dependent instructions– release critical resources

– leave pipeline, and wait outside pipeline in slice buffer

• Miss independent instructions – execute

– release critical resources and retire

• When miss returns– miss-dependent instructions re-acquire resources

– execute and retire

• After miss-dependent instructions execute– results automatically integrated

Page 10: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

10

Continual Flow Pipeline processor

• Critical resource efficient– don’t require large register files, large schedulers

• Need to track all memory operations– large load buffer large CAM footprint and power– hierarchical store queue

• small, fast L1 store queue (32 entries)• large, slow L2 store queue (~512 entries)

large CAM foot print

high leakage power• good performance

Page 11: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

11

Why track all memory operations?

• Stores must update in program order• Load/store dependence speculation• Multiprocessor memory consistency

• Continual Flow Pipeline processors– execute independents ahead of dependents– aggressively reorder memory operations execution

Page 12: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

12

Outline

• Motivation• Resource efficient processors• Store processing

– store queue overview– SRL key idea– SRL workings

• Results• Summary

Page 13: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

13

Functions of a store queue

• Ordering– ensure memory updates are in program order– correctness

• Forwarding– provide data to subsequent loads– performance– CAM

X

ZY

YK

X

ZY

YK

A D

STQ

Z

LD

A DFwd. data

ZMatch

Page 14: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

14

Conventional store queue

• Single structure for ordering, forwarding• Large sizes increase CAM area & leakage

– CAM contribution to area and power dominates

Efficiency Eliminate CAMs

Page 15: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

15

Decoupling ordering from forwarding

CAM

L2 STQ

A D

A D

SRAM

Store Redo Log (SRL)

• FIFO• Program Order• No CAM

Data Cache•Forwarding•No CAM

No CAMs for ordering/forwarding!

Page 16: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

16

Store Redo Log workings (1)

In shadow of a miss• Allocate FIFO L2 store queue (SRL) entry for all stores

– records program order for stores• Dependent stores

– not ready, release L1 store queue entry, and enter SRL• Independent stores

– update cache temporarily, and enter SRL• Loads

– independent loads forward from cache & retire– dependent loads go to slice buffer– do not scan L2 store queue for forwarding

Page 17: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

17

Store Redo Log workings (2)

When miss returns• Discard all independent store updates to cache

– these stores don’t re-execute– their dependents don’t re-execute

• Drain the SRL in program order– reconstruct memory live-outs– program order maintained– no re-execution, only re-update

• no extra cache ports required

Page 18: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

18

Hazards

• Write after Write (WAW)• Write after Read (WAR)• Read After Write (RAW)

Page 19: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

19

Handling hazards: WAW

ST X 12

ST X

ST Y 17

Y 2

X 38

17

12

ST X 5

512

SRLCache

L1 STQ

Miss returns

ST X ST Y ST X

Program Order

Page 20: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

20

Handling hazards: WAR

LD X ST X ST Y

Program Order

ST X 5 LD

ST Y 17

Y

X

2

385

17

LD X38

L1 STQ L1 LDQ Slice Buffer

SRLCache

Miss returns

Page 21: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

21

Handling hazards: RAW

• Detect by snooping completed stores• Restart execution in case of violations

– restore to checkpoint

Page 22: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

22

Outline

• Motivation• Latency tolerant processor background• Store processing• Results• Summary

Page 23: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

23

Evaluation

• Ideal store queue– large L1 STQ (Latency = 3 cycles)– gives upper-bound (impractical to build)

• Hierarchical store queue– L1 STQ (Latency = 3 cycles)– L2 STQ (with CAMs) (Latency = 8 cycles)

• SRL store processing– L1 STQ (Latency = 3 cycles)– FIFO CAM-free Store Redo Log

• Baseline– L1 STQ (Latency = 3 cycles)

Page 24: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

24

SRL performance

0

5

10

15

20

25

30

SFP2K SINT2K WEB MM PROD SERVER WS

% S

pee

du

p o

ver

Bas

elin

e

SRL store processing

Hierarchical STQ

Ideal STQ

Performance within 6% of ideal store queue

Page 25: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

25

Power and area comparison

• Hierarchical store queue – 90nm CMOS technology– SPICE simulations – circuit optimized to reduce leakage power– banked structure to reduce dynamic power

• SRL over Hierarchical STQ– more than 50% reduction in leakage power– more than 90% reduction in dynamic power– 75% reduction in the area

Page 26: Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

26

Summary

• CAM-free secondary structures• Set-associative L2 Load buffer• FIFO L2 Store queue

– Don’t constantly enforce order– Ensure correct order by redoing the stores

• 75% area and 50% leakage power savings• No CAM scalable design