Top Banner
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol- Balter
43

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Dec 31, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Yoongu KimMichael PapamichaelOnur MutluMor Harchol-Balter

Page 2: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

2

Motivation• Memory is a shared resource

• Threads’ requests contend for memory– Degradation in single thread performance– Can even lead to starvation

• How to schedule memory requests to increase both system throughput and fairness?

Core Core

Core CoreMemory

Page 3: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

3

8 8.2 8.4 8.6 8.8 91

3

5

7

9

11

13

15

17

FRFCFSSTFMPAR-BSATLAS

Weighted Speedup

Max

imum

Slo

wdo

wn

Previous Scheduling Algorithms are Biased

System throughput bias

Fairness bias

No previous memory scheduling algorithm provides both the best fairness and system throughput

Ideal

Better system throughput

Bett

er fa

irnes

s

Page 4: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

4

Take turns accessing memory

Why do Previous Algorithms Fail?

Fairness biased approach

thread C

thread B

thread A

less memory intensive

higherpriority

Prioritize less memory-intensive threads

Throughput biased approach

Good for throughput

starvation unfairness

thread C thread Bthread A

Does not starve

not prioritized reduced throughput

Single policy for all threads is insufficient

Page 5: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

5

Insight: Achieving Best of Both Worlds

thread

thread

higherpriority

thread

thread

thread

thread

thread

thread

Prioritize memory-non-intensive threads

For Throughput

Unfairness caused by memory-intensive being prioritized over each other • Shuffle threads

Memory-intensive threads have different vulnerability to interference• Shuffle asymmetrically

For Fairness

thread

thread

thread

thread

Page 6: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

6

OutlineMotivation & InsightsOverviewAlgorithm

Bringing it All TogetherEvaluationConclusion

Page 7: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Overview: Thread Cluster Memory Scheduling

1. Group threads into two clusters2. Prioritize non-intensive cluster3. Different policies for each cluster

7

thread

Threads in the system

thread

thread

thread

thread

thread

thread

Non-intensive cluster

Intensive cluster

thread

thread

thread

Memory-non-intensive

Memory-intensive

Prioritized

higherpriority

higherpriority

Throughput

Fairness

Page 8: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

8

OutlineMotivation & InsightsOverviewAlgorithm

Bringing it All TogetherEvaluationConclusion

Page 9: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

9

TCM Outline

1. Clustering

Page 10: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

10

Clustering ThreadsStep1 Sort threads by MPKI (misses per kiloinstruction)

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

thre

ad

higher MPKI

T α < 10% ClusterThreshold

Intensive clusterαT

Non-intensivecluster

T = Total memory bandwidth usage

Step2 Memory bandwidth usage αT divides clusters

Page 11: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

11

TCM Outline

1. Clustering

2. Between Clusters

Page 12: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

12

Prioritize non-intensive cluster

• Increases system throughput– Non-intensive threads have greater potential for

making progress

• Does not degrade fairness– Non-intensive threads are “light”– Rarely interfere with intensive threads

Prioritization Between Clusters

>priority

Page 13: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

13

TCM Outline

1. Clustering

2. Between Clusters

3. Non-Intensive Cluster

Throughput

Page 14: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

14

Prioritize threads according to MPKI

• Increases system throughput– Least intensive thread has the greatest potential

for making progress in the processor

Non-Intensive Cluster

thread

thread

thread

thread

higherpriority lowest MPKI

highest MPKI

Page 15: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

15

TCM Outline

1. Clustering

2. Between Clusters

3. Non-Intensive Cluster

4. Intensive Cluster

Throughput

Fairness

Page 16: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

16

Periodically shuffle the priority of threads

• Is treating all threads equally good enough?• BUT: Equal turns ≠ Same slowdown

Intensive Cluster

thread

thread

thread

Increases fairness

Most prioritizedhigherpriority

thread

thread

thread

Page 17: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

17

random-access streaming02468

101214

Slow

dow

n

Case Study: A Tale of Two ThreadsCase Study: Two intensive threads contending1. random-access2. streaming

Prioritize random-access Prioritize streaming

random-access thread is more easily slowed down

random-access streaming02468

101214

Slow

dow

n 7xprioritized

1x

11x

prioritized1x

Which is slowed down more easily?

Page 18: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

18

Why are Threads Different?random-access streaming

reqreqreqreq

Bank 1 Bank 2 Bank 3 Bank 4 Memory

rows

•All requests parallel•High bank-level parallelism

•All requests Same row•High row-buffer locality

reqreqreqreq

activated rowreqreqreqreq reqreqreqreqstuck

Vulnerable to interference

Page 19: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

19

TCM Outline

1. Clustering

2. Between Clusters

3. Non-Intensive Cluster

4. Intensive Cluster

Fairness

Throughput

Page 20: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

20

Niceness

How to quantify difference between threads?

Vulnerability to interferenceBank-level parallelism

Causes interferenceRow-buffer locality

+ Niceness -

NicenessHigh Low

Page 21: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

21

Shuffling: Round-Robin vs. Niceness-Aware

1. Round-Robin shuffling2. Niceness-Aware shuffling

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

GOOD: Each thread prioritized once

What can go wrong?

A

BCD

D A B C D

Page 22: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

22

Shuffling: Round-Robin vs. Niceness-Aware

1. Round-Robin shuffling2. Niceness-Aware shuffling

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

What can go wrong?

A

BCD

D A B C D

A

B

DC

B

C

AD

C

D

BA

D

A

CB

BAD: Nice threads receive lots of interference

GOOD: Each thread prioritized once

Page 23: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

23

Shuffling: Round-Robin vs. Niceness-Aware

1. Round-Robin shuffling2. Niceness-Aware shuffling

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice thread

GOOD: Each thread prioritized once

A

BCD

D C B A D

Page 24: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

24

Shuffling: Round-Robin vs. Niceness-Aware

1. Round-Robin shuffling2. Niceness-Aware shuffling

Most prioritized

ShuffleInterval

Priority

Time

Nice thread

Least nice threadA

BCD

D C B A D

D

A

CB

B

A

CD

A

D

BC

D

A

CB

GOOD: Each thread prioritized once

GOOD: Least nice thread stays mostly deprioritized

Page 25: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

25

TCM Outline

1. Clustering

2. Between Clusters

3. Non-Intensive Cluster

4. Intensive Cluster

1. Clustering

2. Between Clusters

3. Non-Intensive Cluster

4. Intensive Cluster

Fairness

Throughput

Page 26: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

26

OutlineMotivation & InsightsOverviewAlgorithm

Bringing it All TogetherEvaluationConclusion

Page 27: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

27

Quantum-Based Operation

Time

Previous quantum (~1M cycles)

During quantum:•Monitor thread behavior

1. Memory intensity2. Bank-level parallelism3. Row-buffer locality

Beginning of quantum:• Perform clustering• Compute niceness of

intensive threads

Current quantum(~1M cycles)

Shuffle interval(~1K cycles)

Page 28: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

28

TCM Scheduling Algorithm

1. Highest-rank: Requests from higher ranked threads prioritized• Non-Intensive cluster > Intensive cluster• Non-Intensive cluster: lower intensity higher rank• Intensive cluster: rank shuffling

2.Row-hit: Row-buffer hit requests are prioritized

3.Oldest: Older requests are prioritized

Page 29: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

29

Implementation Costs

Required storage at memory controller (24 cores)

• No computation is on the critical path

Thread memory behavior Storage

MPKI ~0.2kb

Bank-level parallelism ~0.6kb

Row-buffer locality ~2.9kb

Total < 4kbits

Page 30: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

30

OutlineMotivation & InsightsOverviewAlgorithm

Bringing it All TogetherEvaluationConclusion

Fairness

Throughput

Page 31: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

31

Metrics & Methodology

• MetricsSystem throughput

i

alonei

sharedi

IPC

IPCSpeedupWeighted

sharedi

alonei

i IPC

IPCSlowdownMaximum max

Unfairness

• Methodology– Core model• 4 GHz processor, 128-entry instruction window• 512 KB/core L2 cache

– Memory model: DDR2– 96 multiprogrammed SPEC CPU2006 workloads

Page 32: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

32

Previous WorkFRFCFS [Rixner et al., ISCA00]: Prioritizes row-buffer hits– Thread-oblivious Low throughput & Low fairness

STFM [Mutlu et al., MICRO07]: Equalizes thread slowdowns– Non-intensive threads not prioritized Low throughput

PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism– Non-intensive threads not always prioritized Low

throughput

ATLAS [Kim et al., HPCA10]: Prioritizes threads with less memory service– Most intensive thread starves Low fairness

Page 33: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

33

Results: Fairness vs. Throughput

7.5 8 8.5 9 9.5 104

6

8

10

12

14

16

TCM

ATLAS

PAR-BS

STFM

FRFCFS

Weighted Speedup

Max

imum

Slo

wdo

wn

Better system throughput

Bett

er fa

irnes

s

5%

39%

8%5%

TCM provides best fairness and system throughput

Averaged over 96 workloads

Page 34: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

34

Results: Fairness-Throughput Tradeoff

12 12.5 13 13.5 14 14.5 15 15.5 162

4

6

8

10

12

FRFCFS

Weighted Speedup

Max

imum

Slo

wdo

wn

When configuration parameter is varied…

Adjusting ClusterThreshold

TCM allows robust fairness-throughput tradeoff

STFMPAR-BS

ATLAS

TCM

Better system throughput

Bett

er fa

irnes

s

Page 35: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

35

Operating System Support

• ClusterThreshold is a tunable knob– OS can trade off between fairness and throughput

• Enforcing thread weights– OS assigns weights to threads– TCM enforces thread weights within each cluster

Page 36: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

36

OutlineMotivation & InsightsOverviewAlgorithm

Bringing it All TogetherEvaluationConclusion

Fairness

Throughput

Page 37: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

37

Conclusion• No previous memory scheduling algorithm provides

both high system throughput and fairness– Problem: They use a single policy for all threads

• TCM groups threads into two clusters1. Prioritize non-intensive cluster throughput2. Shuffle priorities in intensive cluster fairness3. Shuffling should favor nice threads fairness

• TCM provides the best system throughput and fairness

Page 38: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

38

THANK YOU

Page 39: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

Yoongu KimMichael PapamichaelOnur MutluMor Harchol-Balter

Page 40: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

40

Thread Weight Support

• Even if heaviest weighted thread happens to be the most intensive thread…– Not prioritized over the least intensive thread

Page 41: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

41

Harmonic Speedup

Better system throughput

Bett

er fa

irnes

s

Page 42: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

42

Shuffling Algorithm Comparison

• Niceness-Aware shuffling– Average of maximum slowdown is lower– Variance of maximum slowdown is lower

Shuffling AlgorithmRound-Robin Niceness-Aware

E(Maximum Slowdown) 5.58 4.84VAR(Maximum Slowdown) 1.61 0.85

Page 43: Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

43

Sensitivity Results

ShuffleInterval (cycles)500 600 700 800

System Throughput 14.2 14.3 14.2 14.7Maximum Slowdown 6.0 5.4 5.9 5.5

Number of Cores4 8 16 24 32

System Throughput(compared to ATLAS) 0% 3% 2% 1% 1%Maximum Slowdown(compared to ATLAS) -4% -30% -29% -30% -41%