Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Adaptive Scheduling for Systems with

Asymmetric Memory Hierarchies

Po-An Tsai, Changping Chen, and Daniel Sanchez

Die-stacking has enabled near-data processing


Conventional multicore processors use

a multi-level deep cache hierarchy to

reduce data movement

Shared LLC

Cores

Private Caches





Shared LLC

Cores

Private Caches

DRAM Dies

Logic

Layer

Near-data processors place

cores close to main memory to


NDP Core

Vault

Controller

Private cache only

(shallow hierarchy)





Shared LLC

Cores

Private Caches

DRAM Dies

Logic

Layer

Near-data processors place

cores close to main memory to


NDP Core

Vault

Controller

Private cache only

(shallow hierarchy)

Neither shallow nor deep hierarchies work well

for all applications…

Asymmetric hierarchies get the best of both worlds


Prior work proposes hybrid system with asymmetric memory hierarchies to get the best of both


[Ahn et al., ISCA’15][Gao et al., PACT’15]

[Hsieh et al., ISCA’16][Boroumand et al., ASPLOS’18]

Applications have strong hierarchy preferences

4


4

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)


4

Performance/J of milcon different hierarchies

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)

0

0.5

1

1.5

2

2.5

3

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J


4


Performance/J of xalancon different hierarchies

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)

0

0.5

1

1.5

2

2.5

3

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

0

0.2

0.4

0.6

0.8

1

1.2

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J


4


How well each application can use the

shared LLC is critical to its preference

Performance/J of xalancon different hierarchies

0

10

20

30

40

50

60

70

80

Deep hier.LLC hit

Shallowhierarchy

Deep hier.LLC miss

Acc

ess

late

ncy

(ns)

0

0.5

1

1.5

2

2.5

3

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

0

0.2

0.4

0.6

0.8

1

1.2

Deephierarchy

Shallowhierarchy

Norm

alized P

erf

/J

Scheduling programs to the right hierarchy is hard

5


5

Many applications prefer different

hierarchies over time because they

have different phases

Performance/J of gems


5

Many applications prefer different

hierarchies over time because they

have different phases

Applications may prefer different

hierarchies due to resource

contention with other applications

0

0.5

1

1.5

2

2.5

Shallowhierarchy

Deephierarchy2MB LLC




Norm

alized P

erf

/J

Performance/J of gems Performance/J of xalanc

Prior schedulers focus on different systems and constraints

6


6

Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])

Focuses on symmetric memory systems (multi-socket LLCs/NUMA)LLC 1

8MB

LLC 2

8MB


6


Focuses on symmetric memory systems (multi-socket LLCs/NUMA)

Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])

Focuses on asymmetric core microarchitectures (big.LITTLE systems)In-order

cores

OoO

cores

LLC 1

8MB

LLC 2

8MB


6




Focuses on asymmetric core microarchitectures (big.LITTLE systems)

NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])

Focuses on single workloads and requires software modifications or compiler support

In-order

cores

OoO

cores

LLC 1

8MB

LLC 2

8MB


6




Focuses on asymmetric core microarchitectures (big.LITTLE systems)

NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])

Focuses on single workloads and requires software modifications or compiler support

By contrast, our goal is to schedule threads considering both memory and

core asymmetries, with no program modifications and transparently to users

In-order

cores

OoO

cores

LLC 1

8MB

LLC 2

8MB

7

Hardware

utility

monitors

Hard

ware

Soft

ware

Sample

accesses

Misse

s

Cache size

Miss curves

Produce

AMS: An asymmetry-aware scheduler

Analytical model that estimates

performance under different hierarchies

First contribution

Schedule threads

Second contribution

Two thread placement algorithms

(AMS-Greedy/AMS-DP) that

extend techniques originally

designed for cache partitioning

AMS analytical model

8

AMS estimates application preferences using total memory access latency


8



8

# M

isse

s

Miss curve from

hardware monitors

LLC Capacity (MB)2 4 6 8


Deep hierarchy has a shared LLC

Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)


8

# M

isse

s

Miss curve from

hardware monitors






8

# M

isse

s

Miss curve from

hardware monitors


A function of LLC capacity





8

# M

isse

s

Miss curve from

hardware monitors


Late

ncy

Latency curve model

Processor-die

core






Shallow hierarchy has no shared LLC

Lat = # accesses x Latency of shallow mem


8

# M

isse

s

Miss curve from

hardware monitors


NDP coreLate

ncy

Latency curve model

Processor-die

core






Shallow hierarchy has no shared LLC

Lat = # accesses x Latency of shallow mem


8

# M

isse

s

Miss curve from

hardware monitors


NDP coreLate

ncy

Latency curve model

Processor-die

core



Use processor-die core

Use NDP core

Handling heterogeneous cores

9

Combine model from prior work (PIE) with our memory latency model


9


NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core



9


NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core


Weigh

by MLP

Processor-die

core


NDP core

Mem

ory

sta

lls

Memory stall curves


9


NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core


Weigh

by MLP

Add

non-memory

component

weighed

by ILP Processor-die

core


NDP core

Mem

ory

sta

lls

Memory stall curves


9


NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core


Weigh

by MLP

Add

non-memory

component

weighed


core


NDP core

Mem

ory

sta

lls

Memory stall curves

Processor-die

core


NDP core

Core

cycl

es

Core cycle curves

Non-mem cycles


9


NDP core

Mem

ory

la

tenc

y

Latency curves

Processor-die core


Weigh

by MLP

Add

non-memory

component

weighed


core


NDP core

Mem

ory

sta

lls

Memory stall curves

Processor-die

core


NDP core

Core

cycl

es

Core cycle curves

Non-mem cycles

Can be extended to other asymmetries,

like frequencies (see paper)

AMS-Greedy overview

10

Solve an optimization problem that seeks to minimize total cost

AMS-Greedy overview

10


Initially, starts by mapping all threads to the deep hierarchy (processor-die)

and moves some threads to the NDP cores over multiple rounds

AMS-Greedy overview

10




AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy




AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

…




AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

…

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy




AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

…

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy

Do

remaining

threads fit

in deep

hier.?




AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

…

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy

Do

remaining

threads fit

in deep

hier.?

Yes

Done




AMS-Greedy overview

10

Input:

Cost curves of all

threads for deep

hierarchy

Cache

partitioning

algo. from

prior work

Partition

plan

T1: 3MB

T2: 1MB

T3: 4MB

…

Compare cost of

deep/shallow

hier. according

to the plan

Map some

threads to

shallow hierarchy

Do

remaining

threads fit

in deep

hier.?

Yes

Done

No

Cost curves for threads still

mapped the deep hierarchy

AMS-Greedy: Leveraging cache partitioning to schedule threads

11


11

Cost


Cost


Cost


Thread 1 Thread 2 Thread 3


11

Cost


Cost


Cost


Partition the LLC

among threads 1-3


8MB


11

Cost


Cost


Cost


3MB

Partition the LLC

among threads 1-3


4MB

8MB

1MB


11

Cost


Cost


Cost


3MB

Partition the LLC

among threads 1-3


4MB

8MB

1MB

: Opportunity cost


11

Uses opportunity cost to decide which thread should give up processor-dieC

ost


Cost


Cost


3MB

Partition the LLC

among threads 1-3


4MB

8MB

1MB

: Opportunity costOpportunity cost <0

move to NDP


11


ost


Cost


Cost


3MB

Partition the LLC

among threads 1-3


4MB

8MB

1MB


move to NDP

Perform multiple rounds

of partitioning until the

processor die is not

oversubscribed


11


ost


Cost


Cost


3MB

Partition the LLC

among threads 1-3


4MB

8MB

1MB


move to NDP

Perform multiple rounds

of partitioning until the

processor die is not

oversubscribed

Overhead: 0.1% of system

cycles when scheduling

every 50ms

AMS-DP: Scheduling threads with dynamic programming

12


12

Prior work has shown that dynamic programming (DP) solve cache partitioning

optimally in polynomial time

We propose an algorithm using DP to solve our optimization problem optimally


12





12





12




AMS-DP serves as the upper bound of AMS-Greedy

But it is more expensive

Data placement for asymmetric hierarchies

13


13


13


13

NDP systems have different constraints from NUMA systems

NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth


13

NDP systems have different constraints from NUMA systems

NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth

We use simple heuristics to keep data from a thread in a single stack

Threads try to allocate to the same stack so long as the stack has enough capacity

See paper for more details

14

Handling multithreaded workloads

AMS-DP formulation

Different system scenarios

Oversubscribed systems

Short-lived workloads or latency critical workloads

Evaluation

15

Evaluation

15

Modeled system:

Evaluation

15

Modeled system:

Evaluation

15

Modeled system:

Deep hierarchy: 8-core processor

32KB L1, 256KB L2, 16MB shared LLC

Evaluation

15

Modeled system:



Shallow hierarchy: 4 memory stacks,

each with 2 NDP cores. Each core has

private 32KB L1 + 256KB L2

Evaluation

15

Modeled system:

Workloads

Multi-programmed SPECCPU

Multithreaded SPECOMP/PARSEC

(see paper)






Evaluation

15

Modeled system:

Workloads



(see paper)






Compared schedulers

Evaluation

15

Modeled system:

Workloads



(see paper)






Compared schedulers

Random (baseline that we normalize to)

Evaluation

15

Modeled system:

Workloads



(see paper)






Compared schedulers


Always NDP/Always processor-die

Evaluation

15

Modeled system:

Workloads



(see paper)






Compared schedulers



Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]

Evaluation

15

Modeled system:

Workloads



(see paper)






Compared schedulers



Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]

AMS-Greedy/AMS-DP

AMS finds the right hierarchy for each application

16


16


16

Always processor never leverages the

NDP capability of the asymmetric system

and is 8% worse than Random


16




Always NDP sometimes hurts applications

that prefer deep hierarchies because it

never leverages the LLC. Only 9% better


16




Always NDP sometimes hurts applications

that prefer deep hierarchies because it

never leverages the LLC. Only 9% better

AMS-Greedy never hurts performance

and improves weighted speedup by up to

37% and by 18% on average

AMS handles resource contention better than prior work

17

Run workloads with 100% utilization to stress contention


17



17



17

AMS-Greedy performs very

close to AMS-DP, only 1% worse



17

AMS-Greedy performs very

close to AMS-DP, only 1% worse

Both AMS-Greedy and AMS-

DP outperform CRUISE

AMS handles asymmetric core + memory well

18


18

Deep hierarchy uses Haswell-like cores

Shallow hierarchy uses Silvermont-like cores


18




18



AMS-Greedy with the PIE model

improves performance more than

handling core/memory

asymmetries separately

See paper for more evaluation results

19

A case study to show AMS adapts to application phases

Multithreaded workloads

Detailed runtime overheads

Sensitivity study for system parameters

Number of cores, LLC capacity, main memory capacity

Performance without and with hardware support for cache partitioning

Conclusion

20

Conclusion

20

Scheduling computation in asymmetric systems is very challenging

Conclusion

20


We present AMS, an adaptive scheduler for asymmetric systems

AMS uses analytical models to adapt quickly and thread mapping algorithms

inspired by cache partitioning algorithms to find high-quality mappings

Hardware

utility

monitors

Hard

ware

Softw

are

Sample

accesses

Misse

s

Cache size

Miss curves

Produce



First contribution

Schedule threads

Second contribution


that extends techniques originally


Thanks! Any questions?

21


We present AMS, an adaptive scheduler for asymmetric systems

AMS uses analytical models to adapt quickly and thread mapping algorithms

inspired by cache partitioning algorithms to find high-quality mappings

Hardware

utility

monitors

Hard

ware

Softw

are

Sample

accesses

Misse

s

Cache size

Miss curves

Produce



First contribution

Schedule threads

Second contribution


that extends techniques originally


Adaptive Scheduling for Systems with Asymmetric Memory ... · NDP Core Vault Controller Private cache only (shallow hierarchy) Die-stacking has enabled near-data processing Conventional

Documents