Top Banner
Stash: Have Your Scratchpad and Cache it Too Matthew D. Sinclair with: Rakesh Komuravelli, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve University of Illinois @ Urbana-Champaign [email protected]
32

Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Jul 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Stash: Have Your Scratchpad and Cache it Too

Matthew D. Sinclair

with:

Rakesh Komuravelli, Johnathan Alsop,

Muhammad Huzaifa, Maria Kotsifakou,

Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve

University of Illinois @ Urbana-Champaign

[email protected]

Page 2: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

SoCs Need an Efficient Memory Hierarchy

2

• Energy-efficient memory hierarchy is essential

– Heterogeneous SoCs use specialized memories

– E.g., scratchpads, FIFOs, stream buffers, …

Scratchpad

Directly addressed: no tags/TLB/conflicts X

Compact storage: no holes in cache lines X

Cache

Page 3: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

SoCs Need an Efficient Memory Hierarchy

3

• Energy-efficient memory hierarchy is essential

– Heterogeneous SoCs use specialized memories

– E.g., scratchpads, FIFOs, stream buffers, …

Can specialized memories be globally addressable, coherent?

Can we have our scratchpad and cache it too?

Scratchpad

Directly addressed: no tags/TLB/conflicts X

Compact storage: no holes in cache lines X

Global address space: implicit data movement X

Coherent: reuse, lazy writebacks X

Cache

Page 4: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Can We Have Our Scratchpad and Cache it Too?

• Make specialized memories globally addressable, coherent

– Efficient address mapping

– Efficient coherence protocol

• Focus: CPU-GPU systems with scratchpads and caches

– Up to 31% less execution time, 51% less energy

4

Stash

Scratchpad Cache

+ Directly addressable

+ Compact storage

+ Global address space

+ Coherent

Page 5: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Outline

• Motivation

• Background: Scratchpads & Caches

• Stash Overview

• Implementation

• Results

• Conclusion

5

Page 6: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Global Addressability

• Scratchpads

– Part of private address space: not globally addressable

Explicit movement

• Cache

+ Globally addressable: part of global address space

Implicit copies, no pollution, support for conditional accesses

6

GPU

Cache

Interconnection n/w

Scratchpad

L2 $ Bank

CPU

Cache

Registers

L2 $ Bank

, pollution, poor conditional accs support

Page 7: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Coherence: Globally Visible Data

• Scratchpads

– Part of private address space: not globally visible

Eager writebacks and invalidations on synchronization

• Cache

+ Globally visible: data kept coherent

Lazy writebacks as space is needed, reuse data across synch

7

Page 8: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Stash – A Scratchpad, Cache Hybrid

8

Scratchpad

Directly addressed: no tags/TLB/conflicts X

Compact storage: no holes in cache lines X

Global address space: implicit data move. X

Coherent: reuse, lazy writebacks X

Cache Stash

Page 9: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Related Work

• Caches:

– Changing Data Layout [HPCA ’99, SC ‘11]

– Elide Tag Accesses [MICRO ’13, ISPLED ‘14]

• Scratchpads:

– Bypassing L1 cache [Southern Island ‘09]

– Virtualizing Private Memories

• [ISPLED ‘11, ISPLED ‘12, UC-B MS ’09, TACO ‘12]

– Scratchpads with DMA support [SC ‘11, PACT ‘14]

• Compare stash to scratchpads with DMA support 9

Page 10: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Outline

• Motivation

• Background: Scratchpads & Caches

• Stash Overview

• Implementation

• Results

• Conclusion

10

Page 11: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Stash: Directly & Globally Addressable

11

• Like scratchpad: directly addressable (for hits)

• Like cache: globally addressable (for misses)

– Implicit loads, no cache pollution

Accelerator

Scratchpad

… …

500 505

// A is global mem addr // scratch_base == 500 for (i = 500; i < 600; i++) { reg ri = load[A+i-500]; scratch[i] = ri ; } reg r = scratch_load[505];

// A is global mem addr // Compiler info: stash_base[500] -> A (M0) // Rk = M0 (index in map) reg r = stash_load[505, Rk ];

Accelerator

Stash

… …

500 505

Generate load[A+5]

500A M0

Map

Page 12: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Stash: Globally Visible

• Stash data can be accessed by other units

• Needs coherence support

• Like cache

– Keep data around – lazy writebacks

– Intra- or inter-kernel data reuse on the same core 12

L2 $ Bank

Interconnection n/w

GPU

Stash

L2 $ Bank

CPU

Cache

Registers Map

$

Page 13: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Stash: Compact Storage

• Caches: cache line granularity storage (“holes” waste)

– Do not compact data

• Like scratchpad, stash compacts data

L2 $ Bank

CU 3

Interconnection n/w

CU 1 CU 2

L2 $ Bank

13

Page 14: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Outline

• Motivation

• Background: Scratchpads & Caches

• Stash Overview

• Implementation

• Results

• Conclusion

14

Page 15: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Stash Software Interface

• Software gives a mapping for each stash allocation

– One map entry (instruction) per stash array per thread block

– Map 2D non-contiguous global regions to stash

15

.

.

.

Global

Stash

Page 16: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Stash Hardware

16

Data Array

S t a t e

Map index

table

Stash-Map

VA

PA

VP-map stash_load[505, Rk];

V Stash base

VA base

Field size, Object size

Row size, Stride size,

#strides

isCoh #Dirty Data

TLB

RTLB

Page 17: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Coherence Support for Stash

• Stash data needs to be kept coherent

• Extend a coherence protocol for three features

– Track stash data at word granularity

– Capability to merge partial lines when stash sends data

– Modify directory to record the modifier and stash-map ID

• We choose to extend the DeNovo protocol

– Simple, low overhead, hybrid of CPU and GPU protocols

17

Page 18: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

DeNovo Coherence (1/2)

• Read hit – don’t return stale data

– Before next parallel phase, selectively self-invalidates

• Needn’t invalidate data it accessed in previous phase

• Read miss – Find one up-to-date copy

– Before end of phase, write miss registers at “directory”

– Shared LLC data arrays double as directory

• Keep valid data or registered core ID

• Stash extension: store map ID at registry

18

registry

Page 19: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

• Assume (for now): Private L1, shared L2; single word line

– Data-race freedom at word granularity

• Line-based DeNovo: word coherence, line address/transfer

DeNovo Coherence (2/2)

Invalid Valid

Registered

Read

Write Write

Read, Write

Read No transient states

No invalidation traffic

No directory storage overhead

No false sharing (word coherence)

19

Page 20: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Outline

• Motivation

• Background: Scratchpads & Caches

• Stash Overview

• Implementation

• Results

• Conclusion

20

Page 21: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Evaluation

• Simulation Environment

– GEMS + Simics + Princeton Garnet N/W + GPGPU-Sim

– Extend McPAT and GPUWattch for energy evaluations

• Workloads:

– 4 microbenchmarks: implicit, reuse, pollution, on-demand

– Heterogeneous workloads: Rodinia, Parboil, SURF

• 1 CPU Core (15 for microbenchmarks)

• 15 GPU Compute Units (1 for microbenchmarks)

• 32 KB L1 Caches, 16 KB Stash/Scratchpad

21

Page 22: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%Sc

r C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St

Evaluation (Microbenchmarks) – Execution Time

22

Implicit Implicit

Scr = Baseline configuration

C = All requests use cache

Scr+D = All requests use scratchpad w/ DMA

St = Converts scratchpad requests to stash

Page 23: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%Sc

r C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St

Evaluation (Microbenchmarks) – Execution Time

23

No explicit loads/stores

Implicit

Page 24: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%Sc

r C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St

Evaluation (Microbenchmarks) – Execution Time

24

Implicit

No cache pollution

Pollution

Page 25: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%Sc

r C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St

Evaluation (Microbenchmarks) – Execution Time

25

Implicit Implicit Pollution Reuse On-Demand

Only bring needed data

Page 26: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%Sc

r C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St

Evaluation (Microbenchmarks) – Execution Time

26

Implicit Implicit Pollution Reuse

Data compaction, reuse

On-Demand

Page 27: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%Sc

r C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St

Evaluation (Microbenchmarks) – Execution Time

27

Implicit Implicit Pollution Reuse Average

• Avg: 27% vs. Scratch, 13% vs. Cache, 14% vs. DMA

On-Demand

Page 28: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Evaluation (Microbenchmarks) – Energy

• Avg: 53% vs. Scratch, 36% vs. Cache, 32% vs. DMA

0%

20%

40%

60%

80%

100%

Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St Scr C

Scr+

D St

GPU Core+ L1 D$ Scratch/Stash L2 $ N/W

Implicit Pollution Reuse Average

28

On-Demand

Page 29: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%

Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St

29

Evaluation (Apps) – Execution Time

BP NW PF SGEMM ST AVERAGE SURF 106

102 103 103

Scr = Reqs use type specified by original app

C = All reqs use cache

St = Converts scratchpad reqs to stash

Page 30: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%

Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St

30

Evaluation (Apps) – Execution Time

BP NW PF SGEMM ST AVERAGE SURF LUD

• Avg: 10% vs. Scratch, 12% vs. Cache (max: 22%, 31%)

– Source: implicit data movement

• Comparable to Scratchpad+DMA

121 106

102 103 103

Page 31: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

0%

20%

40%

60%

80%

100%

Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St

GPU Core+ L1 D$ Scratch/Stash L2 $ N/W

• Avg: 16% vs. Scratch, 32% vs. Cache (max: 30%, 51%)

168 120 180 126 108 128

LUD SURF BP NW PF SGEMM ST AVERAGE

Evaluation (Apps) – Energy

31

Page 32: Stash: Have Your Scratchpad and Cache it Toorsim.cs.uiuc.edu/Talks/15-sinclair-stash.pdf · Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St Scr C St GPU Core+ L1 D$

Conclusion

32

• Make specialized memories globally addressable, coherent

– Efficient address mapping (only for misses)

– Efficient software-driven hardware coherence protocol

• Stash = scratchpad + cache

– Like scratchpads: Directly addressable and compact storage

– Like caches: Globally addressable and globally visible

• Reduced execution time and energy

• Future Work:

– More accelerators & specialized memories; consistency models