Top Banner
Thread Clustering Thread Clustering: Sharing-Aware Thread Scheduling on SMP-CMP-SMT Multiprocessors David Tam, Reza Azimi, Michael Stumm University of Toronto {tamda, azimi, stumm}@eecg.toronto.edu
34
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: thread-clustering

Thread Clustering

Thread Clustering:Sharing-Aware Thread Scheduling

on SMP-CMP-SMT Multiprocessors

David Tam, Reza Azimi, Michael Stumm

University of Toronto{tamda, azimi, stumm}@eecg.toronto.edu

Page 2: thread-clustering

Thread Clustering

Multiprocessors TodayExample: IBM Power 5 system

1

Page 3: thread-clustering

Thread Clustering

Multiprocessors Today

SMPCMP

SMT

SHAREDCACHE

Example: IBM Power 5 system

1

Page 4: thread-clustering

Thread Clustering

Multiprocessors TodayExample: IBM Power 5 system

12014

Disparity in L2 latencies

1

Page 5: thread-clustering

Thread Clustering

Operating Systems TodayCPU Schedulers:

● Ignore disparity in L2 latencies● Ignore data sharing among threads

● Distribute threads poorly● Cross-chip traffic

● Remote L2 cache accesses

● Causes performance problem2

Page 6: thread-clustering

Thread Clustering

Our Goal: Sharing-Aware Scheduling● Detect sharing patterns● Cluster threads

Benefits:● Decrease cross-chip traffic● Increase on-chip cache locality● Exploit shared L2 caches

3

Page 7: thread-clustering

Thread Clustering

Our Online Technique

REPEAT

STEPS:1) Monitor remote cache access rate2) Detect thread sharing patterns3) Determine thread clusters4) Migrate thread clusters

4

Page 8: thread-clustering

Thread Clustering

Sharing Detection● To observe remote cache accesses:

● Exploit HPCs (hardware performance counters)● Sample remote cache miss addresses

● Local cache misses satisfied by remote cache● IBM Power 5 continuous data sampling

X1

5

Page 9: thread-clustering

Thread Clustering

Sharing Detection● To observe remote cache accesses:

● Exploit HPCs (hardware performance counters)● Sample remote cache miss addresses

● Local cache misses satisfied by remote cache● IBM Power 5 continuous data sampling

X

2

5

Page 10: thread-clustering

Thread Clustering

Sharing Detection● To observe remote cache accesses:

● Exploit HPCs (hardware performance counters)● Sample remote cache miss addresses

● Local cache misses satisfied by remote cache● IBM Power 5 continuous data sampling

3

5

Page 11: thread-clustering

Thread Clustering

Sharing Signatures● Construct for each thread

● Counts remote cache accesses

8-bit counter

virtual address264

block

virtual address0

Conceptually

6

Page 12: thread-clustering

Thread Clustering

Sharing Signatures● Construct for each thread

● Counts remote cache accesses

ctri++

virtual address264

virtual address0

Conceptually

6

8-bit counterblock

Page 13: thread-clustering

Thread Clustering

Optimizations● CPU: Temporal Sampling

● Sample every Nth remote cache access● Memory: Spatial Sampling

● 256-entry vector● Hash function● Block ID filter

● Vectors still effective at indicating sharing

7

Page 14: thread-clustering

Thread Clustering

Spatial Sampling● Hash collision & alias removal

EmptyReserved

Filter Legend

0 255

0 255

Block ID

8

Page 15: thread-clustering

Thread Clustering

Spatial Sampling● Hash collision & alias removal

EMPTY

0 255

EmptyReserved

Filter Legend

0 255

hash

Block ID

8

Page 16: thread-clustering

Thread Clustering

Spatial Sampling● Hash collision & alias removal

hash

(First-Come-First-Reserved)

EmptyReserved

Filter Legend

0 255

0 255

Block ID

8

Page 17: thread-clustering

Thread Clustering

Spatial Sampling● Hash collision & alias removal

MATCH Block ID

hash

EmptyReserved

Filter Legend

0 255

0 255

Block ID

8

Page 18: thread-clustering

Thread Clustering

Spatial Sampling● Hash collision & alias removal

MISMATCH Block ID

hash

ALIASING PREVENTED

EmptyReserved

Filter Legend

0 255

0 255

Block ID

8

Page 19: thread-clustering

Thread Clustering

Automated ClusteringClustering Heuristic:

● Simple, one-pass algorithm● Compare vector against existing clusters● If not similar, create a new cluster

Similarity Metric:

● Shared blocks amplified● Non-shared blocks nullified

∑ V1[i] * V

2[i]

i = 0

N

9

Page 20: thread-clustering

Thread Clustering

Experimental Platform● 8-way Power 5, 1.5GHz● Linux 2.6● IBM J2SE 5.0 JVM

1.9MB L236MB

4 GB4 GB

36MB1.9MB L2

10

Page 21: thread-clustering

Thread Clustering

WorkloadsMicrobenchmark

● expect 4 clusters● 4 threads per cluster

SPECjbb2000 (modified)● expect 2 clusters

● 2 warehouses, 8 threads per warehouseRUBiS + MySQL

● expect 2 clusters● 2 databases, 16 threads per database

VolanoMark chat server● expect 2 clusters

● 2 rooms, 8 threads per room

11

Page 22: thread-clustering

Thread Clustering

Visualizing Clusters● An example

Cluster B,4 vectors

Cluster A,4 vectors

12

Counter Values

255128640

{

{

Page 23: thread-clustering

Thread Clustering

Visualizing Clusters● An example

12

Counter Values

255128640

{

{Cluster B,4 vectors

Cluster A,4 vectors

Page 24: thread-clustering

Thread Clustering

Visualizing Clusters● An example

12

Counter Values

255128640

{{Cluster B,

4 vectors

Cluster A,4 vectors

Page 25: thread-clustering

Thread Clustering

Visualizing Clusters● Microbenchmark

{4vectors

13

Page 26: thread-clustering

Thread Clustering

Visualizing Clusters● Modified SPECjbb2000 (4 warehouses)

{16vectors

14

Page 27: thread-clustering

Thread Clustering

Visualizing Clusters● RUBiS + MySQL (2 databases)

{24vectors

15

Page 28: thread-clustering

Thread Clustering

Visualizing Clusters● VolanoMark (4 rooms)

16

Page 29: thread-clustering

Thread Clustering

Remote Cache Impact● Normalized to default Linux

32

90

43

22

7270

92-17

17

Page 30: thread-clustering

Thread Clustering

Performance Impact● IPC: instructions per cycle● Normalized to default Linux

7.4

6.16.1

7.1

5.1

7.4

5.0

3.7

-0.8

18

Page 31: thread-clustering

Thread Clustering

Summary

BEFORE:Current Operating Systems

AFTER:Operating System With

Thread Clustering

19

Page 32: thread-clustering

Thread Clustering

Conclusions● HPCs can detect sharing● Sharing signatures are effective● Automated thread clustering:

● Reduces remote cache access up to 70%● Improves performance up to 7%

● All with low overhead

Future Work:● More workloads● Improve clustering algorithm● Integration with load-balancing aspects

20

Page 33: thread-clustering

Thread Clustering

Page 34: thread-clustering

Thread Clustering

Sampling Overhead● Modified SPECjbb2000