Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies Po-An Tsai , Changping Chen, and Daniel Sanchez
Adaptive Scheduling for Systems with
Asymmetric Memory Hierarchies
Po-An Tsai, Changping Chen, and Daniel Sanchez
Die-stacking has enabled near-data processing
Die-stacking has enabled near-data processing
Conventional multicore processors use
a multi-level deep cache hierarchy to
reduce data movement
Shared LLC
Cores
Private Caches
Die-stacking has enabled near-data processing
Conventional multicore processors use
a multi-level deep cache hierarchy to
reduce data movement
Shared LLC
Cores
Private Caches
DRAM Dies
Logic
Layer
Near-data processors place
cores close to main memory to
reduce data movement
NDP Core
Vault
Controller
Private cache only
(shallow hierarchy)
Die-stacking has enabled near-data processing
Conventional multicore processors use
a multi-level deep cache hierarchy to
reduce data movement
Shared LLC
Cores
Private Caches
DRAM Dies
Logic
Layer
Near-data processors place
cores close to main memory to
reduce data movement
NDP Core
Vault
Controller
Private cache only
(shallow hierarchy)
Neither shallow nor deep hierarchies work well
for all applications…
Asymmetric hierarchies get the best of both worlds
Asymmetric hierarchies get the best of both worlds
Prior work proposes hybrid system with asymmetric memory hierarchies to get the best of both
Asymmetric hierarchies get the best of both worlds
[Ahn et al., ISCA’15][Gao et al., PACT’15]
[Hsieh et al., ISCA’16][Boroumand et al., ASPLOS’18]
Applications have strong hierarchy preferences
4
Applications have strong hierarchy preferences
4
0
10
20
30
40
50
60
70
80
Deep hier.LLC hit
Shallowhierarchy
Deep hier.LLC miss
Acc
ess
late
ncy
(ns)
Applications have strong hierarchy preferences
4
Performance/J of milcon different hierarchies
0
10
20
30
40
50
60
70
80
Deep hier.LLC hit
Shallowhierarchy
Deep hier.LLC miss
Acc
ess
late
ncy
(ns)
0
0.5
1
1.5
2
2.5
3
Deephierarchy
Shallowhierarchy
Norm
alized P
erf
/J
Applications have strong hierarchy preferences
4
Performance/J of milcon different hierarchies
Performance/J of xalancon different hierarchies
0
10
20
30
40
50
60
70
80
Deep hier.LLC hit
Shallowhierarchy
Deep hier.LLC miss
Acc
ess
late
ncy
(ns)
0
0.5
1
1.5
2
2.5
3
Deephierarchy
Shallowhierarchy
Norm
alized P
erf
/J
0
0.2
0.4
0.6
0.8
1
1.2
Deephierarchy
Shallowhierarchy
Norm
alized P
erf
/J
Applications have strong hierarchy preferences
4
Performance/J of milcon different hierarchies
How well each application can use the
shared LLC is critical to its preference
Performance/J of xalancon different hierarchies
0
10
20
30
40
50
60
70
80
Deep hier.LLC hit
Shallowhierarchy
Deep hier.LLC miss
Acc
ess
late
ncy
(ns)
0
0.5
1
1.5
2
2.5
3
Deephierarchy
Shallowhierarchy
Norm
alized P
erf
/J
0
0.2
0.4
0.6
0.8
1
1.2
Deephierarchy
Shallowhierarchy
Norm
alized P
erf
/J
Scheduling programs to the right hierarchy is hard
5
Scheduling programs to the right hierarchy is hard
5
Many applications prefer different
hierarchies over time because they
have different phases
Performance/J of gems
Scheduling programs to the right hierarchy is hard
5
Many applications prefer different
hierarchies over time because they
have different phases
Applications may prefer different
hierarchies due to resource
contention with other applications
0
0.5
1
1.5
2
2.5
Shallowhierarchy
Deephierarchy2MB LLC
Deephierarchy4MB LLC
Deephierarchy8MB LLC
Deephierarchy16MB LLC
Norm
alized P
erf
/J
Performance/J of gems Performance/J of xalanc
Prior schedulers focus on different systems and constraints
6
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA)LLC 1
8MB
LLC 2
8MB
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA)
Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])
Focuses on asymmetric core microarchitectures (big.LITTLE systems)In-order
cores
OoO
cores
LLC 1
8MB
LLC 2
8MB
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA)
Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])
Focuses on asymmetric core microarchitectures (big.LITTLE systems)
NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])
Focuses on single workloads and requires software modifications or compiler support
In-order
cores
OoO
cores
LLC 1
8MB
LLC 2
8MB
Prior schedulers focus on different systems and constraints
6
Contention-aware scheduling (Bubble-up [Mars, MICRO’11], CRUISE [Jaleel, ASPLOS’12])
Focuses on symmetric memory systems (multi-socket LLCs/NUMA)
Heterogeneous core-aware scheduling (PIE [Van Craeynest, ISCA’12][Cong, ISPLED’11])
Focuses on asymmetric core microarchitectures (big.LITTLE systems)
NDP-aware workload partitioning (PIM-enabled Instructions [Ahn, ISCA’15], TOM [Hsieh, ISCA’16])
Focuses on single workloads and requires software modifications or compiler support
By contrast, our goal is to schedule threads considering both memory and
core asymmetries, with no program modifications and transparently to users
In-order
cores
OoO
cores
LLC 1
8MB
LLC 2
8MB
7
Hardware
utility
monitors
Hard
ware
Soft
ware
Sample
accesses
Misse
s
Cache size
Miss curves
Produce
AMS: An asymmetry-aware scheduler
Analytical model that estimates
performance under different hierarchies
First contribution
Schedule threads
Second contribution
Two thread placement algorithms
(AMS-Greedy/AMS-DP) that
extend techniques originally
designed for cache partitioning
AMS analytical model
8
AMS estimates application preferences using total memory access latency
AMS analytical model
8
AMS estimates application preferences using total memory access latency
AMS analytical model
8
# M
isse
s
Miss curve from
hardware monitors
LLC Capacity (MB)2 4 6 8
AMS estimates application preferences using total memory access latency
Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
AMS analytical model
8
# M
isse
s
Miss curve from
hardware monitors
LLC Capacity (MB)2 4 6 8
AMS estimates application preferences using total memory access latency
Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
AMS analytical model
8
# M
isse
s
Miss curve from
hardware monitors
LLC Capacity (MB)2 4 6 8
A function of LLC capacity
AMS estimates application preferences using total memory access latency
Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
AMS analytical model
8
# M
isse
s
Miss curve from
hardware monitors
LLC Capacity (MB)2 4 6 8
Late
ncy
Latency curve model
Processor-die
core
LLC Capacity (MB)2 4 6 8
A function of LLC capacity
AMS estimates application preferences using total memory access latency
Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
Shallow hierarchy has no shared LLC
Lat = # accesses x Latency of shallow mem
AMS analytical model
8
# M
isse
s
Miss curve from
hardware monitors
LLC Capacity (MB)2 4 6 8
NDP coreLate
ncy
Latency curve model
Processor-die
core
LLC Capacity (MB)2 4 6 8
A function of LLC capacity
AMS estimates application preferences using total memory access latency
Deep hierarchy has a shared LLC
Lat = (# accesses x Latency of LLC) + (# misses x Latency of deep mem)
Shallow hierarchy has no shared LLC
Lat = # accesses x Latency of shallow mem
AMS analytical model
8
# M
isse
s
Miss curve from
hardware monitors
LLC Capacity (MB)2 4 6 8
NDP coreLate
ncy
Latency curve model
Processor-die
core
LLC Capacity (MB)2 4 6 8
A function of LLC capacity
Use processor-die core
Use NDP core
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core
Mem
ory
la
tenc
y
Latency curves
Processor-die core
LLC Capacity (MB)2 4 6 8
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core
Mem
ory
la
tenc
y
Latency curves
Processor-die core
LLC Capacity (MB)2 4 6 8
Weigh
by MLP
Processor-die
core
LLC Capacity (MB)2 4 6 8
NDP core
Mem
ory
sta
lls
Memory stall curves
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core
Mem
ory
la
tenc
y
Latency curves
Processor-die core
LLC Capacity (MB)2 4 6 8
Weigh
by MLP
Add
non-memory
component
weighed
by ILP Processor-die
core
LLC Capacity (MB)2 4 6 8
NDP core
Mem
ory
sta
lls
Memory stall curves
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core
Mem
ory
la
tenc
y
Latency curves
Processor-die core
LLC Capacity (MB)2 4 6 8
Weigh
by MLP
Add
non-memory
component
weighed
by ILP Processor-die
core
LLC Capacity (MB)2 4 6 8
NDP core
Mem
ory
sta
lls
Memory stall curves
Processor-die
core
LLC Capacity (MB)2 4 6 8
NDP core
Core
cycl
es
Core cycle curves
Non-mem cycles
Handling heterogeneous cores
9
Combine model from prior work (PIE) with our memory latency model
NDP core
Mem
ory
la
tenc
y
Latency curves
Processor-die core
LLC Capacity (MB)2 4 6 8
Weigh
by MLP
Add
non-memory
component
weighed
by ILP Processor-die
core
LLC Capacity (MB)2 4 6 8
NDP core
Mem
ory
sta
lls
Memory stall curves
Processor-die
core
LLC Capacity (MB)2 4 6 8
NDP core
Core
cycl
es
Core cycle curves
Non-mem cycles
Can be extended to other asymmetries,
like frequencies (see paper)
AMS-Greedy overview
10
Solve an optimization problem that seeks to minimize total cost
AMS-Greedy overview
10
Solve an optimization problem that seeks to minimize total cost
Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Solve an optimization problem that seeks to minimize total cost
Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Input:
Cost curves of all
threads for deep
hierarchy
Solve an optimization problem that seeks to minimize total cost
Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Input:
Cost curves of all
threads for deep
hierarchy
Cache
partitioning
algo. from
prior work
Partition
plan
T1: 3MB
T2: 1MB
T3: 4MB
…
Solve an optimization problem that seeks to minimize total cost
Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Input:
Cost curves of all
threads for deep
hierarchy
Cache
partitioning
algo. from
prior work
Partition
plan
T1: 3MB
T2: 1MB
T3: 4MB
…
Compare cost of
deep/shallow
hier. according
to the plan
Map some
threads to
shallow hierarchy
Solve an optimization problem that seeks to minimize total cost
Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Input:
Cost curves of all
threads for deep
hierarchy
Cache
partitioning
algo. from
prior work
Partition
plan
T1: 3MB
T2: 1MB
T3: 4MB
…
Compare cost of
deep/shallow
hier. according
to the plan
Map some
threads to
shallow hierarchy
Do
remaining
threads fit
in deep
hier.?
Solve an optimization problem that seeks to minimize total cost
Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Input:
Cost curves of all
threads for deep
hierarchy
Cache
partitioning
algo. from
prior work
Partition
plan
T1: 3MB
T2: 1MB
T3: 4MB
…
Compare cost of
deep/shallow
hier. according
to the plan
Map some
threads to
shallow hierarchy
Do
remaining
threads fit
in deep
hier.?
Yes
Done
Solve an optimization problem that seeks to minimize total cost
Initially, starts by mapping all threads to the deep hierarchy (processor-die)
and moves some threads to the NDP cores over multiple rounds
AMS-Greedy overview
10
Input:
Cost curves of all
threads for deep
hierarchy
Cache
partitioning
algo. from
prior work
Partition
plan
T1: 3MB
T2: 1MB
T3: 4MB
…
Compare cost of
deep/shallow
hier. according
to the plan
Map some
threads to
shallow hierarchy
Do
remaining
threads fit
in deep
hier.?
Yes
Done
No
Cost curves for threads still
mapped the deep hierarchy
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Thread 1 Thread 2 Thread 3
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Partition the LLC
among threads 1-3
Thread 1 Thread 2 Thread 3
8MB
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
3MB
Partition the LLC
among threads 1-3
Thread 1 Thread 2 Thread 3
4MB
8MB
1MB
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
3MB
Partition the LLC
among threads 1-3
Thread 1 Thread 2 Thread 3
4MB
8MB
1MB
: Opportunity cost
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Uses opportunity cost to decide which thread should give up processor-dieC
ost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
3MB
Partition the LLC
among threads 1-3
Thread 1 Thread 2 Thread 3
4MB
8MB
1MB
: Opportunity costOpportunity cost <0
move to NDP
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Uses opportunity cost to decide which thread should give up processor-dieC
ost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
3MB
Partition the LLC
among threads 1-3
Thread 1 Thread 2 Thread 3
4MB
8MB
1MB
: Opportunity costOpportunity cost <0
move to NDP
Perform multiple rounds
of partitioning until the
processor die is not
oversubscribed
AMS-Greedy: Leveraging cache partitioning to schedule threads
11
Uses opportunity cost to decide which thread should give up processor-dieC
ost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
Cost
LLC Capacity (MB)2 4 6 8
3MB
Partition the LLC
among threads 1-3
Thread 1 Thread 2 Thread 3
4MB
8MB
1MB
: Opportunity costOpportunity cost <0
move to NDP
Perform multiple rounds
of partitioning until the
processor die is not
oversubscribed
Overhead: 0.1% of system
cycles when scheduling
every 50ms
AMS-DP: Scheduling threads with dynamic programming
12
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
optimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
optimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
optimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP: Scheduling threads with dynamic programming
12
Prior work has shown that dynamic programming (DP) solve cache partitioning
optimally in polynomial time
We propose an algorithm using DP to solve our optimization problem optimally
AMS-DP serves as the upper bound of AMS-Greedy
But it is more expensive
Data placement for asymmetric hierarchies
13
Data placement for asymmetric hierarchies
13
Data placement for asymmetric hierarchies
13
Data placement for asymmetric hierarchies
13
NDP systems have different constraints from NUMA systems
NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth
Data placement for asymmetric hierarchies
13
NDP systems have different constraints from NUMA systems
NDP cores have plentiful intra-stack bandwidth but limited inter-stack bandwidth
We use simple heuristics to keep data from a thread in a single stack
Threads try to allocate to the same stack so long as the stack has enough capacity
See paper for more details
14
Handling multithreaded workloads
AMS-DP formulation
Different system scenarios
Oversubscribed systems
Short-lived workloads or latency critical workloads
Evaluation
15
Evaluation
15
Modeled system:
Evaluation
15
Modeled system:
Evaluation
15
Modeled system:
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Evaluation
15
Modeled system:
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Shallow hierarchy: 4 memory stacks,
each with 2 NDP cores. Each core has
private 32KB L1 + 256KB L2
Evaluation
15
Modeled system:
Workloads
Multi-programmed SPECCPU
Multithreaded SPECOMP/PARSEC
(see paper)
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Shallow hierarchy: 4 memory stacks,
each with 2 NDP cores. Each core has
private 32KB L1 + 256KB L2
Evaluation
15
Modeled system:
Workloads
Multi-programmed SPECCPU
Multithreaded SPECOMP/PARSEC
(see paper)
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Shallow hierarchy: 4 memory stacks,
each with 2 NDP cores. Each core has
private 32KB L1 + 256KB L2
Compared schedulers
Evaluation
15
Modeled system:
Workloads
Multi-programmed SPECCPU
Multithreaded SPECOMP/PARSEC
(see paper)
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Shallow hierarchy: 4 memory stacks,
each with 2 NDP cores. Each core has
private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to)
Evaluation
15
Modeled system:
Workloads
Multi-programmed SPECCPU
Multithreaded SPECOMP/PARSEC
(see paper)
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Shallow hierarchy: 4 memory stacks,
each with 2 NDP cores. Each core has
private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to)
Always NDP/Always processor-die
Evaluation
15
Modeled system:
Workloads
Multi-programmed SPECCPU
Multithreaded SPECOMP/PARSEC
(see paper)
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Shallow hierarchy: 4 memory stacks,
each with 2 NDP cores. Each core has
private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to)
Always NDP/Always processor-die
Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]
Evaluation
15
Modeled system:
Workloads
Multi-programmed SPECCPU
Multithreaded SPECOMP/PARSEC
(see paper)
Deep hierarchy: 8-core processor
32KB L1, 256KB L2, 16MB shared LLC
Shallow hierarchy: 4 memory stacks,
each with 2 NDP cores. Each core has
private 32KB L1 + 256KB L2
Compared schedulers
Random (baseline that we normalize to)
Always NDP/Always processor-die
Extended CRUISE [ASPLOS’12]/PIE [ISCA’11]
AMS-Greedy/AMS-DP
AMS finds the right hierarchy for each application
16
AMS finds the right hierarchy for each application
16
AMS finds the right hierarchy for each application
16
Always processor never leverages the
NDP capability of the asymmetric system
and is 8% worse than Random
AMS finds the right hierarchy for each application
16
Always processor never leverages the
NDP capability of the asymmetric system
and is 8% worse than Random
Always NDP sometimes hurts applications
that prefer deep hierarchies because it
never leverages the LLC. Only 9% better
AMS finds the right hierarchy for each application
16
Always processor never leverages the
NDP capability of the asymmetric system
and is 8% worse than Random
Always NDP sometimes hurts applications
that prefer deep hierarchies because it
never leverages the LLC. Only 9% better
AMS-Greedy never hurts performance
and improves weighted speedup by up to
37% and by 18% on average
AMS handles resource contention better than prior work
17
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
AMS-Greedy performs very
close to AMS-DP, only 1% worse
Run workloads with 100% utilization to stress contention
AMS handles resource contention better than prior work
17
AMS-Greedy performs very
close to AMS-DP, only 1% worse
Both AMS-Greedy and AMS-
DP outperform CRUISE
AMS handles asymmetric core + memory well
18
AMS handles asymmetric core + memory well
18
Deep hierarchy uses Haswell-like cores
Shallow hierarchy uses Silvermont-like cores
AMS handles asymmetric core + memory well
18
Deep hierarchy uses Haswell-like cores
Shallow hierarchy uses Silvermont-like cores
AMS handles asymmetric core + memory well
18
Deep hierarchy uses Haswell-like cores
Shallow hierarchy uses Silvermont-like cores
AMS-Greedy with the PIE model
improves performance more than
handling core/memory
asymmetries separately
See paper for more evaluation results
19
A case study to show AMS adapts to application phases
Multithreaded workloads
Detailed runtime overheads
Sensitivity study for system parameters
Number of cores, LLC capacity, main memory capacity
Performance without and with hardware support for cache partitioning
Conclusion
20
Conclusion
20
Scheduling computation in asymmetric systems is very challenging
Conclusion
20
Scheduling computation in asymmetric systems is very challenging
We present AMS, an adaptive scheduler for asymmetric systems
AMS uses analytical models to adapt quickly and thread mapping algorithms
inspired by cache partitioning algorithms to find high-quality mappings
Hardware
utility
monitors
Hard
ware
Softw
are
Sample
accesses
Misse
s
Cache size
Miss curves
Produce
Analytical model that estimates
performance under different hierarchies
First contribution
Schedule threads
Second contribution
Two thread placement algorithms
that extends techniques originally
designed for cache partitioning
Thanks! Any questions?
21
Scheduling computation in asymmetric systems is very challenging
We present AMS, an adaptive scheduler for asymmetric systems
AMS uses analytical models to adapt quickly and thread mapping algorithms
inspired by cache partitioning algorithms to find high-quality mappings
Hardware
utility
monitors
Hard
ware
Softw
are
Sample
accesses
Misse
s
Cache size
Miss curves
Produce
Analytical model that estimates
performance under different hierarchies
First contribution
Schedule threads
Second contribution
Two thread placement algorithms
that extends techniques originally
designed for cache partitioning