Top Banner
© 2005 Babak Falsafi Temporal Memory Streaming Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos STEMS STEMS Computer Architecture Lab Carnegie Mellon http://www.ece.cmu.edu/CALCM
43

© 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

Jan 01, 2016

Download

Documents

Walter Baker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

© 2005 Babak Falsafi

Temporal Memory StreamingTemporal Memory Streaming

Babak Falsafi

Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos

STEMSSTEMSComputer Architecture Lab Carnegie Mellonhttp://www.ece.cmu.edu/CALCM

Page 2: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

2© 2005 Babak Falsafi

The Memory Wall

Logic/DRAM speed gap continues to increase!

0.33

10

0.04

80

6

0.01

0.1

1

10

100

1000

Clo

ck

s p

er

ins

tru

cti

on

0.01

0.1

1

10

100

1000

Clo

ck

s p

er

DR

AM

ac

ce

ss

Core (s)Memory

VAX/1980 PPro/1996 2010+

Page 3: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

3© 2005 Babak Falsafi

Current Approach Cache hierarchies:• Trade off capacity for speed• Exploit “reuse”

But, in modern servers• Only 50% utilization of one proc.

[Ailamaki, VLDB’99]

• Much bigger problem in MPs

What is wrong?• Demand fetch/repl. data 100GDRAMDRAM

CCPPUU

L3 64M

L2 2M

100

0 clk

L1 64K

1 c

lk 1

0 cl

k 1

00 c

lk

Page 4: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

4© 2005 Babak Falsafi

Prior Work (SW-Transparent)

Prefetching [Joseph 97] [Roth 96] [Nesbit 04] [Gracia Pérez 04]

Simple patterns or low accuracy

Large Exec. Windows / Runahead [Mutlu 03]

Fetch dependent addresses serially

Coherence Optimizations [Stenström 93] [Lai 00] [Huh 04]

Limited applicability (e.g., migratory)

Need solutions for arbitrary access patterns

Page 5: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

5© 2005 Babak Falsafi

Our Solution: Spatio-Temporal Memory StreamingSpatio-Temporal Memory Streaming

Observation:• Data spatially/temporally correlated • Arbitrary, yet repetitive, patterns

Approach Memory Streaming• Extract spat./temp. patterns• Stream data to/from CPU

Manage resources for multiple blocks Break dependence chains

• In HW, SW or both

L1

CPUCPU

AC

DB

..

STEMSSTEMS

ZX

W

stre

am

repl

ace

fetc

h

DRAM or DRAM or Other CPUsOther CPUs

Page 6: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

6© 2005 Babak Falsafi

Contribution #1: Temporal Shared-Memory Streaming

• Recent coherence miss sequences recur 50% misses closely follow previous sequence Large opportunity to exploit MLP

• Temporal streaming engine Ordered streams allow practical HW Performance improvement:

7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

Page 7: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

7© 2005 Babak Falsafi

Contribution #2: Last-touch Correlated Data Streaming

• Last-touch prefetchers Cache block deadtime >> livetime Fetch on a predicted “last touch” But, designs impractical (> 200MB on-chip)

• Last-touch correlated data streaming Miss order ~ last-touch order Stream table entries from off-chip Eliminates 75% of all L1 misses with ~200KB

Page 8: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

8© 2005 Babak Falsafi

Outline

• STEMS OverviewSTEMS Overview

• Example Temporal Streaming

1. Temporal Shared-Memory Streaming

2. Last-Touch Correlated Data Streaming

• Summary

Page 9: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

9© 2005 Babak Falsafi

• Record sequences of memory accesses• Transfer data sequences ahead of requests

Temporal Shared-Memory Streaming [ISCA’05]

Baseline System

CPU Mem

Miss A

Fill A

Streaming System

Miss B

Fill B

CPU MemMiss A

Fill A,B,C,…

• Accelerates arbitrary access patterns Parallelizes critical path of pointer-chasing

Page 10: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

10© 2005 Babak Falsafi

Relationship Between Misses

• Intuition: Miss sequences repeat Because code sequences repeat

• Observed for uniprocessors in [Chilimbi’02]

• Temporal Address Correlation Same miss addresses repeat in the same order

Correlated miss sequence = stream

Q W A B C D E R T A B C D E YMiss seq. …

Page 11: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

11© 2005 Babak Falsafi

Relationship Between Streams

• Intuition: Streams exhibit temporal locality Because working set exhibits temporal locality For shared data, repetition often across nodes

• Temporal Stream Locality Recent streams likely to recur

Q W A B C D E R

T A B C D E Y

Node 1

Node 2

Addr. correlation + stream locality = temporal correlation

Page 12: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

12© 2005 Babak Falsafi

Memory Level Parallelism

• Streams create MLP for dependent misses

• Not possible with larger windows / runahead

Temporal streaming breaks dependence chains

AB

C

Baseline

CPU

Must wait to follow pointers

Temporal Streaming

CPU

Fetch in parallel

AB

C

Page 13: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

13© 2005 Babak Falsafi

Temporal Streaming

Record

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Page 14: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

14© 2005 Babak Falsafi

Temporal Streaming

Record

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Miss AReq. A

Fill A

Page 15: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

15© 2005 Babak Falsafi

Temporal Streaming

Record

Locate

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Miss AReq. A

Fill ALocate A

Stream B, C, D

Page 16: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

16© 2005 Babak Falsafi

Temporal Streaming

Node 1

Miss AMiss BMiss CMiss D

Directory Node 2

Miss AReq. A

Fill ALocate A

Stream B, C, D

Record

Locate

Stream

Fetch B, C, D

Hit B

Hit C

Page 17: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

17© 2005 Babak Falsafi

Temporal Streaming Engine

Record• Coherence Miss Order Buffer (CMOB)

~1.5MB circular buffer per node In local memory Addresses only Coalesced accesses

CPU

$

Local Memory

CMOBQ W A B C D E R T Y

Fill E

Page 18: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

18© 2005 Babak Falsafi

Temporal Streaming Engine

Locate

• Annotate directory Already has coherence info for every block CMOB append send pointer to directory Coherence miss forward stream request

Directory

A

B

shared Node 4 @ CMOB[23]

modified Node 11 @ CMOB[401]

Page 19: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

19© 2005 Babak Falsafi

Temporal Streaming Engine

Stream• Fetch data to match use rate

Addresses in FIFO stream queue Fetch into streamed value buffer

F E D C BStream Queue

CPU

L1 $Streamed

Value Buffer

Node i: stream {A,B,C…}

A data

Fetch A

~32 entries

Page 20: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

20© 2005 Babak Falsafi

Practical HW Mechanisms

• Streams recorded/followed in order FIFO stream queues ~32-entry streamed value buffer Coalesced cache-block size CMOB appends

• Predicts many misses from one request More lookahead Allows off-chip stream storage Leverages existing directory lookup

Page 21: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

21© 2005 Babak Falsafi

Methodology: Infrastructure

SimFlex [SIGMETRICS’04]

Statistically sampling → uArch sim. in minutes Full-system MP simulation (boots Linux & Solaris)

Uni, CMP, DSM timing models Real server software (e.g., DB2 & Oracle) Software publicly available for download

http://www.ece.cmu.edu/~simflex

Page 22: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

22© 2005 Babak Falsafi

Methodology: Benchmarks & Parameters

Model Parameters 16 4GHz SPARC CPUs

8-wide OoO; 8-stage pipe

256-entry ROB/LSQ

64K L1, 8MB L2

TSO w/ speculation

Benchmark Applications• Scientific

em3d, moldyn, ocean• OLTP: TPC-C 3.0 100 WH

IBM DB2 7.2 Oracle 10g

• SPECweb99 w/ 16K con. Apache 2.0 Zeus 4.3

Page 23: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

23© 2005 Babak Falsafi

TSE Coverage Comparison

0%

50%

100%

150%

200%

250%S

trid

eG

/DC

G/A

CT

SE

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

Str

ide

G/D

CG

/AC

TS

E

em3d moldyn ocean Apache DB2 Oracle Zeus

% C

oh

ere

nt

Re

ad

Mis

se

s Coverage Discards

TSE outperforms Stride and GHB for coherence misses

Page 24: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

24© 2005 Babak Falsafi

Stream Lengths

0%

20%

40%

60%

80%

100%0 1 2 4 8 16

32

64

128

256

512

1K

2K

4K

8K

16K

32K

64K

128K

Length (# of streamed blocks)

Cu

m. %

of

All H

its

Apache DB2 Oracle Zeusem3d moldyn ocean

• Comm: Short streams; low base MLP (1.2-1.3)• Sci: Long streams; high base MLP (1.6-6.6)• Temporal Streaming addresses both cases

Page 25: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

25© 2005 Babak Falsafi

3.3

1.0

1.1

1.2

1.3

em3d

mol

dyn

ocea

n

Apa

che

DB

2O

racl

eZ

eus

TSE Performance Impact

-

0.2

0.4

0.6

0.8

1.0

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

ba

seT

SE

em3d moldyn ocean Apache DB2 Oracle Zeus

No

rma

lize

d T

ime

Busy Other Stalls Coherent Read Stalls

em3d moldyn ocean Apache DB2 Oracle Zeus

Time Breakdown Speedup

95% CI

• TSE eliminates 25%-95% of coherent read stalls6% to 230% performance improvement

Page 26: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

26© 2005 Babak Falsafi

TSE Conclusions

• Temporal Streaming Intuition: Recent coherence miss sequences recur Impact: Eliminates 50-100% of coherence misses

• Temporal Streaming Engine Intuition: In-order streams enable practical HW Impact: Performance improvement

7%-230% in scientific apps. 6%-21% in commercial Web & OLTP apps.

Page 27: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

27© 2005 Babak Falsafi

Outline

• Big Picture

• Example Streaming Techniques

1. Temporal Shared Memory Streaming

2. Last-Touch Correlated Data Streaming

• Summary

Page 28: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

28© 2005 Babak Falsafi

Enhancing Lookahead

Observation [Mendelson, Wood&Hill]:

• Few live sets Use until last “hit” Data reuse high hit rate ~80% dead frames!

Exploit for lookahead:• Predict last “touch” prior to “death”• Evict, predict and fetch next line

L1 @ Time T1

L1 @ Time T2

Live

set

s D

ead

set

s

Page 29: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

29© 2005 Babak Falsafi

How Much Lookahead?

Predicting last-touches will eliminate all latency!

DR

AM

lat

ency

L2

late

ncy

Frame Deadtimes (cycles)

Page 30: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

30© 2005 Babak Falsafi

Dead-Block Prediction [ISCA’00 & ’01]

• Per-block trace of memory accesses to a block Predicts repetitive last-touch events

PC3: load/store A1

PC1: load/store A1

PC3: load/store A1

PC5: load/store A3

Acc

esse

s to

a b

lock

fram

e

(miss)

(hit)

(hit)

(miss)

PC0: load/store A0 (hit)

Trace = A1 (PC1,PC3, PC3)

Last touch

First touch

Page 31: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

31© 2005 Babak Falsafi

Dead-Block Prefetcher (DBCP)

Evict A1Fetch A3

11

Correlation Table

A3A1,PC1,PC3,PC3PC1,PC3

History Table (HT)

PC3

Current Access

Latest

A1

• History & correlation tables History ~ L1 tag array Correlation ~ memory footprint

• Encoding truncated addition• Two bit saturating counter

Page 32: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

32© 2005 Babak Falsafi

0%

20%

40%

60%

80%

100%

120%

Olden SPEC INT SPEC FP

(%)

of

Cac

he

Mis

ses

early

train

incorrect

correct

DBCP Coverage with Unlimited Table Storage

• High average L1 miss coverage• Low misprediction (2-bit counters)

Page 33: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

33© 2005 Babak Falsafi

0%

20%

40%

60%

80%

100%

160K

B

640K

B

2M

B

10M

B

40M

B

160M

B

On-Chip Correlation Table Size

% o

f A

ch

ievab

le C

overa

ge

average

worst-case

Impractical On-Chip Storage Size

Needs over 150MB to achieve full potential!

Page 34: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

34© 2005 Babak Falsafi

Our Observation: Signatures are Temporally Correlated

Signatures need not reside on chip1. Last-touch sequences recur

• Much as cache miss sequences recur [Chilimbi’02]

• Often due to large structure traversals

2. Last-touch order ~ cache miss order• Off by at most L1 cache capacity

Key implications:• Can record last touches in miss order• Store & stream signatures from off-chip

Page 35: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

35© 2005 Babak Falsafi

Last-Touch Correlated Data Streaming(LT-CORDS)

• Streaming signatures on chip Keep all sigs. in sequences in off-chip DRAM Retain sequence “heads” on chip “Head” signals a stream fetch

• Small (~200KB) on-chip stream cache Tolerate order mismatch Lookahead for stream startup

DBCP coverage with moderate on-chip storage!

Page 36: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

36© 2005 Babak Falsafi

DBCP Mechanisms

Core L1L2

DRAM

HT

All signatures in random-access on-chip table

Sigs. (160MB)

Page 37: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

37© 2005 Babak Falsafi

Only a subset needed at a time“Head” as cue for the “stream”

Signatures stored off-chip

What LT-CORDS Does

Core L1L2

DRAM

HT

… and only in order

Page 38: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

38© 2005 Babak Falsafi

LT-CORDS Mechanisms

Core L1L2

DRAM

HT

SC

On-chip storage independent of footprint

Heads (10K) Sigs. (200K)

Page 39: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

39© 2005 Babak Falsafi

Methodology• SimpleScalar CPU model with Alpha ISA

SPEC CPU2000 & Olden benchmarks

• 8-wide out-of-order processor 2 cycle L1, 16 cycle L2, 180 cycle DRAM FU latencies similar to Alpha EV8 64KB 2-way L1D, 1MB 8-way L2

• LT-CORDS with 214KB on-chip storage• Apps. with significant memory stalls

Page 40: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

40© 2005 Babak Falsafi

0%

20%

40%

60%

80%

100%

120%D

BC

P

LT-

cord

s

DB

CP

LT-

cord

s

DB

CP

LT-

cord

s

Olden SPEC INT SPEC FP

(%)

of

Ca

ch

e M

iss

es

early

train

incorrect

correct

LT-CORDS vs. DBCP Coverage

LT-CORDS reaches infinite DBCP coverage

Page 41: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

41© 2005 Babak Falsafi

1

2

3

4

5

treead

dmgri

dsw

im

equa

keap

plu mcffm

a3d art

em3d

facere

c

wupwise bh

ammp gc

cpa

rser

sixtra

ck

Sp

ee

du

p

Infinite Cache

LT-CORDS

GHB (PC/DC)

LT-CORDS Speedup

LT-CORDS hides large fraction of memory latency

Page 42: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

42© 2005 Babak Falsafi

LT-CORDS Conclusions

• Intuition: Signatures temporally correlated Cache miss & last-touch sequences recur Miss order ~ last-touch order

• Impact: eliminates 75% of all misses Retains DBCP coverage, lookahead, accuracy On-chip storage indep. of footprint 2x speedup over best prior work

Page 43: © 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

43© 2005 Babak Falsafi

For more informationVisit our website:http://www.ece.cmu.edu/CALCM