Top Banner
Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Po-An Tsai , Changping Chen, and Daniel Sanchez
94

Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Aug 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Adaptive Scheduling for Systems with

Asymmetric Memory Hierarchies

Po-An Tsai, Changping Chen, and Daniel Sanchez

Page 2: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Die-stacking has enabled near-data processing

Page 3: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Die-stacking has enabled near-data processing

Conventional multicore processors use

a multi-level deep cache hierarchy to

reduce data movement

Shared LLC

Cores

Private Caches

Page 4: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Die-stacking has enabled near-data processing

Conventional multicore processors use

a multi-level deep cache hierarchy to

reduce data movement

Shared LLC

Cores

Private Caches

DRAM Dies

Logic

Layer

Near-data processors place

cores close to main memory to

reduce data movement

NDP Core

Vault

Controller

Private cache only

(shallow hierarchy)

Page 5: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Die-stacking has enabled near-data processing

Conventional multicore processors use

a multi-level deep cache hierarchy to

reduce data movement

Shared LLC

Cores

Private Caches

DRAM Dies

Logic

Layer

Near-data processors place

cores close to main memory to

reduce data movement

NDP Core

Vault

Controller

Private cache only

(shallow hierarchy)

Neither shallow nor deep hierarchies work well

for all applications…

Page 6: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Asymmetric hierarchies get the best of both worlds

Page 7: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Asymmetric hierarchies get the best of both worlds

Page 8: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Prior work proposes hybrid system with asymmetric memory hierarchies to get the best of both

Asymmetric hierarchies get the best of both worlds

[Ahn et al., ISCA’15][Gao et al., PACT’15]

[Hsieh et al., ISCA’16][Boroumand et al., ASPLOS’18]

Page 9: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Applications have strong hierarchy preferences

4

Page 10: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Applications have strong hierarchy preferences

4

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)

Page 11: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Applications have strong hierarchy preferences

4

Performance/J of milcon different hierarchies

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)

0

0.5

1

1.5

2

2.5

3

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

Page 12: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Applications have strong hierarchy preferences

4

Performance/J of milcon different hierarchies

Performance/J of xalancon different hierarchies

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)

0

0.5

1

1.5

2

2.5

3

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

0

0.2

0.4

0.6

0.8

1

1.2

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

Page 13: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Applications have strong hierarchy preferences

4

Performance/J of milcon different hierarchies

How well each application can use the

shared LLC is critical to its preference

Performance/J of xalancon different hierarchies

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)

0

0.5

1

1.5

2

2.5

3

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

0

0.2

0.4

0.6

0.8

1

1.2

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

Page 14: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Scheduling programs to the right hierarchy is hard

5

Page 15: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Scheduling programs to the right hierarchy is hard

5

Many applications prefer different

hierarchies over time because they

have different phases

Performance/J of gems

Page 16: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Scheduling programs to the right hierarchy is hard

5

Many applications prefer different

hierarchies over time because they

have different phases

Applications may prefer different

hierarchies due to resource

contention with other applications

0

0.5

1

1.5

2

2.5

Shallowhierarchy

Deephierarchy2MB LLC

Deephierarchy4MB LLC

Deephierarchy8MB LLC

Deephierarchy16MB LLC

Norm

alized P

erf

/J

Performance/J of gems Performance/J of xalanc

Page 17: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Prior schedulers focus on different systems and constraints

6

Page 18: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Prior schedulers focus on different systems and constraints

6

Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

Focuses on symmetric memory systems (multi-socket LLCs/NUMA)LLC 1

8MB

LLC 2

8MB

Page 19: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Prior schedulers focus on different systems and constraints

6

Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

Focuses on symmetric memory systems (multi-socket LLCs/NUMA)

Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])

Focuses on asymmetric core microarchitectures (big.LITTLE systems)In-order

cores

OoO

cores

LLC 1

8MB

LLC 2

8MB

Page 20: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Prior schedulers focus on different systems and constraints

6

Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

Focuses on symmetric memory systems (multi-socket LLCs/NUMA)

Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])

Focuses on asymmetric core microarchitectures (big.LITTLE systems)

NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])

Focuses on single workloads and requires software modifications or compiler support

In-order

cores

OoO

cores

LLC 1

8MB

LLC 2

8MB

Page 21: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Prior schedulers focus on different systems and constraints

6

Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

Focuses on symmetric memory systems (multi-socket LLCs/NUMA)

Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])

Focuses on asymmetric core microarchitectures (big.LITTLE systems)

NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])

Focuses on single workloads and requires software modifications or compiler support

By contrast, our goal is to schedule threads considering both memory and

core asymmetries, with no program modifications and transparently to users

In-order

cores

OoO

cores

LLC 1

8MB

LLC 2

8MB

Page 22: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

7

Hardware

utility

monitors

Hard

ware

Soft

ware

Sample

accesses

Misse

s

Cache size

Miss curves

Produce

AMS: An asymmetry-aware scheduler

Analytical model that estimates

performance under different hierarchies

First contribution

Schedule threads

Second contribution

Two thread placement algorithms

(AMS-Greedy/AMS-DP) that

extend techniques originally

designed for cache partitioning

Page 23: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS analytical model

8

Page 24: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS estimates application preferences using total memory access latency

AMS analytical model

8

Page 25: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS estimates application preferences using total memory access latency

AMS analytical model

8

# M

isse

s

Miss curve from

hardware monitors

LLC Capacity (MB)2 4 6 8

Page 26: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS estimates application preferences using total memory access latency

Deep hierarchy has a shared LLC

Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

AMS analytical model

8

# M

isse

s

Miss curve from

hardware monitors

LLC Capacity (MB)2 4 6 8

Page 27: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS estimates application preferences using total memory access latency

Deep hierarchy has a shared LLC

Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

AMS analytical model

8

# M

isse

s

Miss curve from

hardware monitors

LLC Capacity (MB)2 4 6 8

A function of LLC capacity

Page 28: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS estimates application preferences using total memory access latency

Deep hierarchy has a shared LLC

Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

AMS analytical model

8

# M

isse

s

Miss curve from

hardware monitors

LLC Capacity (MB)2 4 6 8

Late

ncy

Latency curve model

Processor-die

core

LLC Capacity (MB)2 4 6 8

A function of LLC capacity

Page 29: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS estimates application preferences using total memory access latency

Deep hierarchy has a shared LLC

Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

Shallow hierarchy has no shared LLC

Lat = # accesses x Latency of shallow mem

AMS analytical model

8

# M

isse

s

Miss curve from

hardware monitors

LLC Capacity (MB)2 4 6 8

NDP coreLate

ncy

Latency curve model

Processor-die

core

LLC Capacity (MB)2 4 6 8

A function of LLC capacity

Page 30: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS estimates application preferences using total memory access latency

Deep hierarchy has a shared LLC

Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)

Shallow hierarchy has no shared LLC

Lat = # accesses x Latency of shallow mem

AMS analytical model

8

# M

isse

s

Miss curve from

hardware monitors

LLC Capacity (MB)2 4 6 8

NDP coreLate

ncy

Latency curve model

Processor-die

core

LLC Capacity (MB)2 4 6 8

A function of LLC capacity

Use processor-die core

Use NDP core

Page 31: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Handling heterogeneous cores

9

Combine model from prior work (PIE) with our memory latency model

Page 32: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Handling heterogeneous cores

9

Combine model from prior work (PIE) with our memory latency model

NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core

LLC Capacity (MB)2 4 6 8

Page 33: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Handling heterogeneous cores

9

Combine model from prior work (PIE) with our memory latency model

NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core

LLC Capacity (MB)2 4 6 8

Weigh

by MLP

Processor-die

core

LLC Capacity (MB)2 4 6 8

NDP core

Mem

ory

sta

lls

Memory stall curves

Page 34: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Handling heterogeneous cores

9

Combine model from prior work (PIE) with our memory latency model

NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core

LLC Capacity (MB)2 4 6 8

Weigh

by MLP

Add

non-memory

component

weighed

by ILP Processor-die

core

LLC Capacity (MB)2 4 6 8

NDP core

Mem

ory

sta

lls

Memory stall curves

Page 35: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Handling heterogeneous cores

9

Combine model from prior work (PIE) with our memory latency model

NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core

LLC Capacity (MB)2 4 6 8

Weigh

by MLP

Add

non-memory

component

weighed

by ILP Processor-die

core

LLC Capacity (MB)2 4 6 8

NDP core

Mem

ory

sta

lls

Memory stall curves

Processor-die

core

LLC Capacity (MB)2 4 6 8

NDP core

Core

cycl

es

Core cycle curves

Non-mem cycles

Page 36: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Handling heterogeneous cores

9

Combine model from prior work (PIE) with our memory latency model

NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core

LLC Capacity (MB)2 4 6 8

Weigh

by MLP

Add

non-memory

component

weighed

by ILP Processor-die

core

LLC Capacity (MB)2 4 6 8

NDP core

Mem

ory

sta

lls

Memory stall curves

Processor-die

core

LLC Capacity (MB)2 4 6 8

NDP core

Core

cycl

es

Core cycle curves

Non-mem cycles

Can be extended to other asymmetries,

like frequencies (see paper)

Page 37: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy overview

10

Page 38: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

AMS-Greedy overview

10

Page 39: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

Page 40: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Page 41: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

Page 42: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy

Page 43: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy

Do

remaining

threads fit

in deep

hier.?

Page 44: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy

Do

remaining

threads fit

in deep

hier.?

Yes

Done

Page 45: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Solve an optimization problem that seeks to minimize total cost

Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy

Do

remaining

threads fit

in deep

hier.?

Yes

Done

No

Cost curves for threads still

mapped the deep hierarchy

Page 46: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Page 47: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Thread 1 Thread 2 Thread 3

Page 48: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Partition the LLC

among threads 1-3

Thread 1 Thread 2 Thread 3

8MB

Page 49: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

3MB

Partition the LLC

among threads 1-3

Thread 1 Thread 2 Thread 3

4MB

8MB

1MB

Page 50: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

3MB

Partition the LLC

among threads 1-3

Thread 1 Thread 2 Thread 3

4MB

8MB

1MB

: Opportunity cost

Page 51: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Uses opportunity cost to decide which thread should give up processor-dieC

ost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

3MB

Partition the LLC

among threads 1-3

Thread 1 Thread 2 Thread 3

4MB

8MB

1MB

: Opportunity costOpportunity cost <0

move to NDP

Page 52: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Uses opportunity cost to decide which thread should give up processor-dieC

ost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

3MB

Partition the LLC

among threads 1-3

Thread 1 Thread 2 Thread 3

4MB

8MB

1MB

: Opportunity costOpportunity cost <0

move to NDP

Perform multiple rounds

of partitioning until the

processor die is not

oversubscribed

Page 53: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-Greedy: Leveraging cache partitioning to schedule threads

11

Uses opportunity cost to decide which thread should give up processor-dieC

ost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

Cost

LLC Capacity (MB)2 4 6 8

3MB

Partition the LLC

among threads 1-3

Thread 1 Thread 2 Thread 3

4MB

8MB

1MB

: Opportunity costOpportunity cost <0

move to NDP

Perform multiple rounds

of partitioning until the

processor die is not

oversubscribed

Overhead: 0.1% of system

cycles when scheduling

every 50ms

Page 54: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-DP: Scheduling threads with dynamic programming

12

Page 55: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-DP: Scheduling threads with dynamic programming

12

Prior work has shown that dynamic programming (DP) solve cache partitioning

optimally in polynomial time

We propose an algorithm using DP to solve our optimization problem optimally

Page 56: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-DP: Scheduling threads with dynamic programming

12

Prior work has shown that dynamic programming (DP) solve cache partitioning

optimally in polynomial time

We propose an algorithm using DP to solve our optimization problem optimally

Page 57: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-DP: Scheduling threads with dynamic programming

12

Prior work has shown that dynamic programming (DP) solve cache partitioning

optimally in polynomial time

We propose an algorithm using DP to solve our optimization problem optimally

Page 58: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS-DP: Scheduling threads with dynamic programming

12

Prior work has shown that dynamic programming (DP) solve cache partitioning

optimally in polynomial time

We propose an algorithm using DP to solve our optimization problem optimally

AMS-DP serves as the upper bound of AMS-Greedy

But it is more expensive

Page 59: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Data placement for asymmetric hierarchies

13

Page 60: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Data placement for asymmetric hierarchies

13

Page 61: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Data placement for asymmetric hierarchies

13

Page 62: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Data placement for asymmetric hierarchies

13

NDP systems have different constraints from NUMA systems

NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth

Page 63: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Data placement for asymmetric hierarchies

13

NDP systems have different constraints from NUMA systems

NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth

We use simple heuristics to keep data from a thread in a single stack

Threads try to allocate to the same stack so long as the stack has enough capacity

Page 64: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

See paper for more details

14

Handling multithreaded workloads

AMS-DP formulation

Different system scenarios

Oversubscribed systems

Short-lived workloads or latency critical workloads

Page 65: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Page 66: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Page 67: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Page 68: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Page 69: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Page 70: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Workloads

Multi-programmed SPECCPU

Multithreaded SPECOMP/PARSEC

(see paper)

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Page 71: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Workloads

Multi-programmed SPECCPU

Multithreaded SPECOMP/PARSEC

(see paper)

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Compared schedulers

Page 72: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Workloads

Multi-programmed SPECCPU

Multithreaded SPECOMP/PARSEC

(see paper)

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Compared schedulers

Random (baseline that we normalize to)

Page 73: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Workloads

Multi-programmed SPECCPU

Multithreaded SPECOMP/PARSEC

(see paper)

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Compared schedulers

Random (baseline that we normalize to)

Always NDP/Always processor-die

Page 74: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Workloads

Multi-programmed SPECCPU

Multithreaded SPECOMP/PARSEC

(see paper)

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Compared schedulers

Random (baseline that we normalize to)

Always NDP/Always processor-die

Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]

Page 75: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Evaluation

15

Modeled system:

Workloads

Multi-programmed SPECCPU

Multithreaded SPECOMP/PARSEC

(see paper)

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Compared schedulers

Random (baseline that we normalize to)

Always NDP/Always processor-die

Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]

AMS-Greedy/AMS-DP

Page 76: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS finds the right hierarchy for each application

16

Page 77: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS finds the right hierarchy for each application

16

Page 78: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS finds the right hierarchy for each application

16

Always processor never leverages the

NDP capability of the asymmetric system

and is 8% worse than Random

Page 79: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS finds the right hierarchy for each application

16

Always processor never leverages the

NDP capability of the asymmetric system

and is 8% worse than Random

Always NDP sometimes hurts applications

that prefer deep hierarchies because it

never leverages the LLC. Only 9% better

Page 80: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS finds the right hierarchy for each application

16

Always processor never leverages the

NDP capability of the asymmetric system

and is 8% worse than Random

Always NDP sometimes hurts applications

that prefer deep hierarchies because it

never leverages the LLC. Only 9% better

AMS-Greedy never hurts performance

and improves weighted speedup by up to

37% and by 18% on average

Page 81: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS handles resource contention better than prior work

17

Page 82: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

Page 83: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

Page 84: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

AMS-Greedy performs very

close to AMS-DP, only 1% worse

Page 85: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Run workloads with 100% utilization to stress contention

AMS handles resource contention better than prior work

17

AMS-Greedy performs very

close to AMS-DP, only 1% worse

Both AMS-Greedy and AMS-

DP outperform CRUISE

Page 86: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS handles asymmetric core + memory well

18

Page 87: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS handles asymmetric core + memory well

18

Deep hierarchy uses Haswell-like cores

Shallow hierarchy uses Silvermont-like cores

Page 88: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS handles asymmetric core + memory well

18

Deep hierarchy uses Haswell-like cores

Shallow hierarchy uses Silvermont-like cores

Page 89: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

AMS handles asymmetric core + memory well

18

Deep hierarchy uses Haswell-like cores

Shallow hierarchy uses Silvermont-like cores

AMS-Greedy with the PIE model

improves performance more than

handling core/memory

asymmetries separately

Page 90: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

See paper for more evaluation results

19

A case study to show AMS adapts to application phases

Multithreaded workloads

Detailed runtime overheads

Sensitivity study for system parameters

Number of cores, LLC capacity, main memory capacity

Performance without and with hardware support for cache partitioning

Page 91: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Conclusion

20

Page 92: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Conclusion

20

Scheduling computation in asymmetric systems is very challenging

Page 93: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Conclusion

20

Scheduling computation in asymmetric systems is very challenging

We present AMS, an adaptive scheduler for asymmetric systems

AMS uses analytical models to adapt quickly and thread mapping algorithms

inspired by cache partitioning algorithms to find high-quality mappings

Hardware

utility

monitors

Hard

ware

Softw

are

Sample

accesses

Misse

s

Cache size

Miss curves

Produce

Analytical model that estimates

performance under different hierarchies

First contribution

Schedule threads

Second contribution

Two thread placement algorithms

that extends techniques originally

designed for cache partitioning

Page 94: Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Thanks! Any questions?

21

Scheduling computation in asymmetric systems is very challenging

We present AMS, an adaptive scheduler for asymmetric systems

AMS uses analytical models to adapt quickly and thread mapping algorithms

inspired by cache partitioning algorithms to find high-quality mappings

Hardware

utility

monitors

Hard

ware

Softw

are

Sample

accesses

Misse

s

Cache size

Miss curves

Produce

Analytical model that estimates

performance under different hierarchies

First contribution

Schedule threads

Second contribution

Two thread placement algorithms

that extends techniques originally

designed for cache partitioning