-
Active Global Address Space (AGAS): Global Virtual Memory for
Dynamic Adaptive
Many-Tasking (AMT) Runtimes
Abhishek KulkarniCenter for Research in Extreme Scale
Technologies
Indiana UniversityNovember 19, 2015
Dissertation Research Showcase Presentation
-
HPC Programming / Execution models
2
Message Passing Asynchronous Many-tasking (AMT)
• BSP: Compute-Communicate paradigm
• Data divided statically among available processors
• Lightweight tasks concurrently operating on global data
• Well-suited for dynamic and adaptive execution
✓ ✓
-
Asynchronous Many-Tasking (AMT) models
• Static/Dynamic task and dataflow graph• Lightweight threads•
Inter-thread synchronization
• Active Messages• Global Address Space
3
5
6
4
7 8
9
1 3
2
0,0 0,1 0,2
1,0 1,1 1,2
2,0 2,1 2,2
5 4
0
2
Task Graph Data Distribution
Logical Representation
Task-p
arallel
runtim
es
Network
Global Address SpaceLocalMemoryLocal
Memory
CTT
SW-AGASPGAS/HW-AGAS
Parcel/GAS Transport
Distri
buted
execut
ion
-
ParalleX Execution Model• Lightweight multi-threading
– Divides work into smaller tasks– Increases concurrency
• Message-driven computation– Move work to data– Keeps work
local, stops blocking
• Constraint-based synchronization– Declarative criteria for
work– Event driven– Eliminates global barriers
• Data-directed execution– Merger of flow control and data
structure• Shared name space
– Global address space– Simplifies random gathers
4
-
High Performance ParalleX
• The HPX runtime system reifies the ParalleXexecution model:–
Localities– Lightweight Threads– Processes– Local Control Objects
(LCOs)– Parcels (Active Messages)– Active Global Address Space
(AGAS)
• Fine-grained concurrency– Tasks, Interrupts, Lightweight
Threads
• Lightweight, globally-addressable synchronization objects– Eg:
Futures (IVars), Monotonic counters (Gate), Reductions
http://hpx.crest.iu.edu/
Global Address SpaceThread
Scheduler
MemoryCores
NIC
LCOManager
ProcessManager
ActionManager
ParcelTransport
NetworkInterfaceMemory
Allocator
Operating System
5
-
Global Address Space in HPX-5
• Global memory is globally addressable
• Supports local, cyclic, blocked and user-defined
distributions
• Two modes of operations:– Static (PGAS)– Active (AGAS)
• Asynchronous access to global memory
• Memory must be pinned locally prior to access
• GAS attributes, atomics, collectives
Network
Global Address SpaceLocalMemoryLocal
Memory
CTT
SW-AGASPGAS/HW-AGAS
Parcel/GAS Transport
6
-
Partitioned GlobalAddress Space
• Global Physical Memory• Physical location encoded
in the address• Memory space same as
address space• Faster address translation• Traditional PGAS
supports put/get/atomics• Static
7
Active GlobalAddress Space
• Global Virtual Memory• Physical location
maintained separately• Memory space distinct
from address space• Address translation
potentially expensive• Support arbitrary AMs• Dynamic (allows
migration)
-
Load and Data Imbalance in Graph Algorithms
8
1D 2D
0
2
4
6
agas−ll
agas−p
red
agas−ra
ndom
agas−s
ucc
pgas
agas−ll
agas−p
red
agas−ra
ndom
agas−s
ucc
pgas
Dat
a Im
bala
nce
Graph500 RMAT (scale=20, partitions=1024)256 512
0e+00
1e+08
2e+08
3e+08
4e+08
5e+08
agas−ll
agas−p
red
agas−ra
ndom
agas−s
ucc
pgas
agas−ll
agas−p
red
agas−ra
ndom
agas−s
ucc
pgas
Dat
a D
istri
butio
n Va
rianc
e
Graph500 RMAT (scale=20, partitions=1024)
• Successor (agas-succ)• Predecessor (agas-pred)
Move the smallest partition to:• Random (agas-random)• Least
Loaded (agas-ll)
-
AGAS Features
9
• User-defined data distributions– Existing PGAS models support
cyclic, block-cyclic– Similar to Chapel, ZPL– Higher-order and
composable
• Data Migration– Explicit moving of individual chunks–
Rebalancing of global data– Implicit runtime-managed remapping
based on introspection
• Co-data Actions– Relational dependencies between tasks and
data– Runtime chooses between a) moving data to computation or
b)
moving computation to data
• Caching, Data stealing, speculative migration
-
Software AGAS• Flat, byte-addressable global
address space• Chunk size limited to 4TB,
Nodes limited to 64K• Scalable memory allocators
handle local/cyclic segments– jemalloc, TBBMalloc
• Chunk table maps LVA à GVA• CTT (Chunk Translation Table)
maps GVA à LVA• CTT entry:
10
01101001001010010 101011110000001010
OS Virtual Memory
Scalable Global Memory Allocator
(Local)
Scalable Global Memory Allocator
(Cyclic)
HPX-5 AGAS Allocator
Chunk Table CTT
Local Chunk
Allocator
Cyclic Chunk
Allocator
0 32 64 128 192 256 288
count owner lva blocks on_unpin dist
Figure 5.5: A CTT entry.
Allocation is conceptually simple. The chunk size is rounded off
to the nearest power of 2, and the log2(size)is stored in the size
class bits in the address. This is necessary to implement blocked
allocations in AGAS, as thatinformation is not available through
the address arithmetic interface. To translate the LVA offset to a
GVA, weneed a chunk table shared by both segments. With access to
the scalable memory allocators metadata, a shadowallocation scheme
could reduce this lookup down to O(1). Cyclic allocations are
implemented using a symmetriccyclic heap managed by the root
locality. This is conceptually simple and works fine in practice
because a) cyclicallocations in HPX-5 are synchronous, and b)
cyclic allocations are collective and require an update to the
CTTon all localities. I will introduce an asynchronous lazy cyclic
allocation scheme in HPX-5 implemented usinga segmented symmetric
cyclic heap. For memory allocations with arbitrary distributions,
the distribution mapwill be stored in the CTT to be refered to
during address arithmetic.
The deallocation procedure is similar to allocation, but for one
complication: deallocation has to check for anyoutstanding “pin”
counts before freeing the chunks. As HPX-5 supports asynchronous
deallocation, to avoid wait-ing threads, deallocation inserts a
“continuation parcel” in the CTT to be released when the “pin”
count reaches 0.
Chunk Translation: Chunk translation is performed through a
try_pin operation which bumps the “pin”count associated with the
chunk entry (refer Figure 5.5) in the CTT. The chunk table and CTT
are presentlyimplemented using a fast, scalable cuckoo hashtable
implementation. The CTT must maintain the currentmapping from the
blocks that it owns to their corresponding base virtual addresses,
as well as any additionalstate required. The initial residing
locality of the chunk (pointed to by the distribution map) is
considered as thehome of the block. When a chunk is moved to
another locality, the owner field in the CTT entry is updated to
pointto the destination locality. In HPX-5, the CTT is distributed
such that only the entries for the “home” chunksare maintained at
each locality. When chunks move, the right existing entries are set
to forward to the ownerand new entries are inserted. For looking up
a GVA, if an entry in the CTT is not found, the “home” bits of
theaddress are used. If the chunk is owned locally, the parcel
addressed to that chunk is directly spawned as a threadon that
locality. In case of arbitrary distributions, the corresponding CTT
entries are broadcasted to all of thelocalities. Going forward, I
plan to separate the translation table from the routing table to
aid in co-design withhardware-based directory implementations of
AGAS. Should resolution be an issue, I will implement
replicatedconcurrent chunk tries, as they typically have lower
storage requirements in the average case. Tries can
coalescecontiguous regions of memory and have to maintain nodes for
the “holes” in the address space caused by move.
Mover Home Owner
move
move
Target
rebind
complete
complete
owns? rebind
Figure 5.6: AGAS Move
Move Protocol: The various AGAS implementations share the
high-level move protocol diagrammed inFigure 5.6. A mover initiates
a move of a source chunk to a destination locality by sending a
move operationto the destination containing the address of the
source. The destination can concurrently allocate space for
thechunk, update the routing for the chunk if necessary, and send a
completion message to the source. The sourceinvalidates its local
mapping—replacing it with a forwarding record for the destination
when using softwareAGAS. It waits for all local readers or writers
to drain away, and then sends the chunk data to the destination.Any
new pin requests are disallowed by returning a false. Once the
destination has received the entire chunkit inserts the local
address translation locally.
13
-
Cyclic Allocation
• Symmetric cyclic segment/heap• Can be segmented to speed up
cyclic allocation
– At the expense of limiting allocation size
11
1. alloc(N)2. reservememory
3. bcast alloc
4. alloc local memand insert ctt entry
-
Move
• When a chunk moves, its owner is updated
• Move protocol– Overlaps routing updates and
data transfer
12
Mover Home Owner
move
move
Target
rebind
complete
complete
owns? rebind
• Home always “knows” where a chunk is• Current implementation
uses a two-sided
transport– Relies on being able to “resend” a parcel
• Complex interactions between move, unpin and delete
-
Hardware AGAS
• Network-managed global address space
• CTT implemented by TCAM in hardware
• The network “knows” the location of chunks
• Pros:– Message optimal in terms of hops
required for routing
• Cons:– Centralized directory– Existing hardware
limitations
13
1. request(666)
3. reply
Network
2. resolve
0x0-0x1FF
0x200-0x3FF
0x400-0x5FF
0x600-0x7FF
0x800-0x9FF
S0
S4
-
Hardware AGAS• Proof of concept based on
the GASNet runtime system
• Uses SDN for routing table management
• Uses IB RD multicast• IB multicast GIDs are
mapped to multicast Ethernet MAC addresses
• Put performance:– High switching latency– Photon conduit
overheads
14
S
H
TSwitch
Home(Rank 0)
Source(Rank 1)
Target(Rank 2)
Ranks 0, 1, 2 push OpenFlow mods:
33:33:00:00:00:01 ➝ port133:33:00:00:00:02 ➝
port233:33:00:00:00:03 ➝ port3
Ranks 0, 1 and 2 locally attach to multicast addresses
based on their page id:0 ➝ ff0e::ffff::11 ➝ ff0e::ffff::22 ➝
ff0e::ffff::3
1
2
Endpoints attach multicast addresses for all block-ids
(ff0e::ffff:).The switch does forwarding based on the destination
MAC address associated with the block
id (33:33:)
3
4 6
5,7
8
16
32
64
128
256
512
1024
512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M
Tim
e ta
ken
(us)
Message Size (bytes)
software (IBV)software (Photon)
hardware
-
Comparing Hardware and Software AGAS
• Bounded software cache– Cache-replacement policies
(random, LRU)– Evictions cause cache
misses and degrade GAS perfomance
• Thread contention in concurrent cache– Concurrent cache is
faster
than a single-threaded cache but 25% slower than the hardware
implementation
15
6.0e+05
6.5e+05
7.0e+05
7.5e+05
8.0e+05
8.5e+05
9.0e+05
0 1 2 3 4 5 6
Glob
al U
pdat
es P
er S
econ
d
Number of contending threads per process
hwsw (single-lock cache)sw (concurrent cache)
4.0e+054.2e+054.4e+054.6e+054.8e+055.0e+055.2e+055.4e+055.6e+055.8e+056.0e+056.2e+05
1000 1500 2000 2500 3000 3500 4000
Glob
al U
pdat
es P
er S
econ
d
Maximum Cache Entries
sw (random)sw (LRU)
hw
-
Remap performance in HW AGAS
• GUPS microbenchmark– table size 2048 words, 512 words/page, 16
nodes (192 cores)– 4 million random updates were performed
• As page movement frequency increases, SW becomes more
expensive than HW
16
��
����
��
����
��
����
��
�� �� �� �� �� �� �� �� �� ���
��������������
���������������������
������������������������
��
���
����
�����
�� �� �� �� �� �� �� �� �� ���
��������������
���������������������
��������������������
-
Conclusions
• The HPX-5 runtime system provides dynamic, adaptive features
necessary for execution of large-scale irregular applications.
• Global virtual memory using an active global address space
(AGAS) enables the runtime to manage both locality and
load-imbalance concerns.
• Network-managed AGAS shows promise at smaller scales but
requires support from vendors and supercomputing centers to be
viable at larger scales.
17
-
Questions?
18
18