Top Banner
1 Operating System Management of Shared Caches on Multicore Processors -- Ph.D. Thesis Presentation -- Apr. 20, 2010 David Tam Supervisor: Michael Stumm
27

Ph.D. thesis presentation

Dec 04, 2014

Download

Documents

davidkftam

Slides from my Ph.D. thesis presentation.

"Operating System Management of Shared Caches on Multicore Processors"
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ph.D. thesis presentation

1

Operating System Managementof Shared Caches

on Multicore Processors

-- Ph.D. Thesis Presentation --

Apr. 20, 2010

David Tam

Supervisor: Michael Stumm

Page 2: Ph.D. thesis presentation

2

Multicores Today

Multicores are Ubiquitous● Unexpected by most software developers● Software support is lacking (e.g., OS)

General Role of OS ● Manage shared hardware resources

New Candidate● Shared cache: performance critical● Focus of thesis

Cache Cache

Shared Cache

...

...

Page 3: Ph.D. thesis presentation

3

ThesisOS should manage on-chip shared cachesof multicore processors

Demonstrate:● Properly managing shared caches at OS level

can increase performance

Management Principles1. Promote sharing

● For threads that share data● Maximize major advantage of shared caches

2. Provide isolation● For threads that do not share data● Minimize major disadvantage of shared caches

Supporting Role● Provision the shared cache online

Page 4: Ph.D. thesis presentation

4

#1 - Promote SharingProblem: Cross-chip accesses are slowSolution: Exploit major advantage of shared caches:

Fast access to shared dataOS Actions: Identify & localize data sharingView: Match software sharing to hardware sharing

Thread A Thread B

Shared Data Traffic

Shared Data Shared Data

Chip A Chip B

L2 L2

Page 5: Ph.D. thesis presentation

5

#1 - Promote Sharing

Chip A Chip BThread A

Thread B

L2 L2Shared Data

Problem: Cross-chip accesses are slowSolution: Exploit major advantage of shared caches:

Fast access to shared dataOS Actions: Identify & localize data sharingView: Match software sharing to hardware sharing

Page 6: Ph.D. thesis presentation

6

Identify Data Sharing● Detect sharing online with hardware performance counters

● Monitor remote cache accesses (data addresses)● Track on a per-thread basis● Data addresses are memory regions shared with other threads

Localize Data Sharing● Identify clusters of threads that access same memory regions● Migrate threads of a cluster onto same chip

Thread A Thread B

Shared Data Traffic

Shared Data Shared Data

Chip A Chip B

L2 L2

Thread Clustering [EuroSys'07]

Page 7: Ph.D. thesis presentation

7

Chip A Chip BThread A

Thread B

L2 L2Shared Data

Identify Data Sharing● Detect sharing online with hardware performance counters

● Monitor remote cache accesses (data addresses)● Track on a per-thread basis● Data addresses are memory regions shared with other threads

Localize Data Sharing● Identify clusters of threads that access same memory regions● Migrate threads of a cluster onto same chip

Thread Clustering [EuroSys'07]

Page 8: Ph.D. thesis presentation

8

Visualization of Clusters

{16threads

● SPECjbb 2000● 4 warehouses, 16 threads per warehouse

● Threads have been sorted by cluster for visualization

Memory Regions

Threads

Sharing IntensityHighMediumLowNone

0 264

(Virtual Address)

Page 9: Ph.D. thesis presentation

9

{16threads

Memory Regions

Threads

Sharing IntensityHighMediumLowNone

Memory Regions0 264

(Virtual Address)

Visualization of Clusters● SPECjbb 2000

● 4 warehouses, 16 threads per warehouse

● Threads have been sorted by cluster for visualization

Page 10: Ph.D. thesis presentation

10

Performance Results

● Multithreaded commercial workloads● RUBiS, VolanoMark, SPECjbb2k

● 8-way IBM POWER5 Linux system● 22%, 32%, 70% reduction in stalls caused by

cross-chip accesses● 7%, 5%, 6% performance improvement

● 32-way IBM POWER5+ Linux system● 14% SPECjbb2k potential improvement

36 MB

4 GB

1.9MB L2 1.9MB L236 MB

4 GB

Page 11: Ph.D. thesis presentation

11

#2 – Provide Isolation

Apache

MySQL

Problem: Major disadvantage of shared cachesCache space interference

Solution: Provide cache space isolation between applicationsOS Actions: Enforce isolation during physical page allocationView: Partition into smaller private caches

Page 12: Ph.D. thesis presentation

12

#2 – Provide Isolation

Apache

MySQL

Problem: Major disadvantage of shared cachesCache space interference

Solution: Provide cache space isolation between applicationsOS Actions: Enforce isolation during physical page allocationView: Partition into smaller private caches

Page 13: Ph.D. thesis presentation

13

#2 – Provide Isolation

Apache

MySQL

Problem: Major disadvantage of shared cachesCache space interference

Solution: Provide cache space isolation between applicationsOS Actions: Enforce isolation during physical page allocationView: Partition into smaller private caches

Page 14: Ph.D. thesis presentation

14

#2 – Provide Isolation

Apache

MySQL

Boundary

Problem: Major disadvantage of shared cachesCache space interference

Solution: Provide cache space isolation between applicationsOS Actions: Enforce isolation during physical page allocationView: Partition into smaller private caches

Page 15: Ph.D. thesis presentation

15

#2 – Provide Isolation

Apache

MySQL

Problem: Major disadvantage of shared cachesCache space interference

Solution: Provide cache space isolation between applicationsOS Actions: Enforce isolation during physical page allocationView: Partition into smaller private caches

Boundary

Page 16: Ph.D. thesis presentation

16

Cache Partitioning● Apply page-coloring technique● Guide physical page allocation to control cache line usage● Works on existing processors

Physical PagesColor A

Color A

Color A

}Color A(N sets)

L2 Cache{

Virtual Pages

Application

Fixed Mapping(Hardware)

OS Managed

[WIOSCA'07]

Page 17: Ph.D. thesis presentation

17

Physical PagesColor A

Color A

Color A

}Color A(N sets)

L2 Cache{

Virtual Pages

Application A

Fixed Mapping(Hardware)

OS Managed

Virtual Pages

Application B

Color B

Color B

Color B

}Color B(N sets)

{

● Apply page-coloring technique● Guide physical page allocation to control cache line usage● Works on existing processors

Cache Partitioning [WIOSCA'07]

Page 18: Ph.D. thesis presentation

18

Impact of Partitioning

PerformanceWithoutIsolation

16 14 12 10 8 6 4 2 0 artmcf

L2 Cache Sizes (# of Colors)

art

mcf

Performance of Other Combos● 10 pairs of applications: SPECcpu2k, SPECjbb2k

● 4% to 17% improvement (36MB L3 cache)● 28%, 50% improvement (no L3 cache)

Page 19: Ph.D. thesis presentation

190 10 20 30 40 50 60 70 80 90 100

0

10

20

30

40

50

60

70

80

90

100Application X

Allocated Cache Size (%)

Mis

s R

ate

(%)

Provisioning the CacheProblem: How to determine cache partition size

Solution: Use L2 cache miss rate curve (MRC) of applicationCriteria: Obtain MRC rapidly, accurately, online, with low overhead,

on existing hardware OS Actions: Monitor L2 cache accesses

using hardware performance counters

Page 20: Ph.D. thesis presentation

20

Design● Upon every L2 access:

● Update sampling register with data address● Trigger interrupt to copy register to trace log in main memory

● Feed trace log into Mattson's stack algorithm [1970]to obtain L2 MRC

Results● Workloads

● 30 apps from SPECcpu2k, SPECcpu2k6, SPECjbb2k● Latency

● 227 ms to generate online L2 MRC● Accuracy

● Good, e.g. up to 27% performance improvement when applied to cache partitioning

RapidMRC [ASPLOS'09]

Page 21: Ph.D. thesis presentation

21

xalancbmk

● Execution slice at 10 billion instructions

Cache Size (# colors)

Mis

s R

ate

(MP

KI)

jbb

mcf 2k

gzip mgrid

Accuracy of RapidMRC

ammp

Page 22: Ph.D. thesis presentation

22

Performance WithoutIsolationRapidMRC Real MRC

L2 Cache Sizes (# of colors)16 14 12 10 8 6 4 2 0

twolfequake

0 2 4 6 8 10 12 14 16

Effectiveness on Provisioning

Performance of Other Combos Using RapidMRC● 12% improvement for vpr+applu● 14% improvement for ammp+3applu

Page 23: Ph.D. thesis presentation

23

ContributionsOn commodity multicores, first to demonstrate● Mechanism: To detect data sharing online & automatically cluster threads● Benefits: Promoting sharing [EuroSys'07]

● Mechanism: To partition shared cache by applying page-coloring● Benefits: Providing isolation [WIOSCA'07]

● Mechanism: To approximate L2 MRCs online in software● Benefits: Provisioning the cache [ASPLOS'09]

...all performed by the OS.

Page 24: Ph.D. thesis presentation

24

Concluding RemarksDemonstrated Performance Improvements● Promoting Sharing

● 5% – 7% SPECjbb2k, RUBiS, VolanoMark (2 chips)● 14% potential: SPECjbb2k (8 chips)

● Providing Isolation● 4% – 17% 8 combos: SPECcpu2k, SPECjbb2k (36MB L3 cache)● 28%, 50% 2 combos: SPECcpu2k (no L3 cache)

● Provisioning the Cache Online● 12% – 27% 3 combos: SPECcpu2k

OS should manage on-chip shared caches

Page 25: Ph.D. thesis presentation

25

Thank You

Page 26: Ph.D. thesis presentation

26

24-9=15 slides

Page 27: Ph.D. thesis presentation

27

Future Research OpportunitiesShared cache management principles can be applied to other layers:

● Application, managed runtime, virtual machine monitor

Promoting sharing● Improve locality on NUMA multiprocessor systems

Providing isolation● Finer granularity, within one application [MICRO'08]

● Regions● Objects

RapidMRC● Online L2 MRCs

● Reducing energy● Guiding co-scheduling

● Underlying Tracing Mechanism● Trace other hardware events