Active Global Address Space (AGAS)sc15.supercomputing.org/sites/all/themes/SC15images/doctoral_showc… · This is necessary to implement blocked allocations in AGAS, as that information

Active Global Address Space (AGAS): Global Virtual Memory for Dynamic Adaptive

Many-Tasking (AMT) Runtimes

Abhishek KulkarniCenter for Research in Extreme Scale Technologies

Indiana UniversityNovember 19, 2015

Dissertation Research Showcase Presentation

HPC Programming / Execution models

2

Message Passing Asynchronous Many-tasking (AMT)

• BSP: Compute-Communicate paradigm

• Data divided statically among available processors

• Lightweight tasks concurrently operating on global data

• Well-suited for dynamic and adaptive execution

✓ ✓

Asynchronous Many-Tasking (AMT) models

• Static/Dynamic task and dataflow graph• Lightweight threads• Inter-thread synchronization

• Active Messages• Global Address Space

3

5

6

4

7 8

9

1 3

2

0,0 0,1 0,2

1,0 1,1 1,2

2,0 2,1 2,2

5 4

0

2

Task Graph Data Distribution

Logical Representation

Task-p

arallel

runtim

es

Network

Global Address SpaceLocalMemoryLocal

Memory

CTT

SW-AGASPGAS/HW-AGAS

Parcel/GAS Transport

Distri

buted

execut

ion

ParalleX Execution Model• Lightweight multi-threading

– Divides work into smaller tasks– Increases concurrency

• Message-driven computation– Move work to data– Keeps work local, stops blocking

• Constraint-based synchronization– Declarative criteria for work– Event driven– Eliminates global barriers

• Data-directed execution– Merger of flow control and data

structure• Shared name space

– Global address space– Simplifies random gathers

4

High Performance ParalleX

• The HPX runtime system reifies the ParalleXexecution model:– Localities– Lightweight Threads– Processes– Local Control Objects (LCOs)– Parcels (Active Messages)– Active Global Address Space (AGAS)

• Fine-grained concurrency– Tasks, Interrupts, Lightweight Threads

• Lightweight, globally-addressable synchronization objects– Eg: Futures (IVars), Monotonic counters (Gate), Reductions

http://hpx.crest.iu.edu/

Global Address SpaceThread

Scheduler

MemoryCores

NIC

LCOManager

ProcessManager

ActionManager

ParcelTransport

NetworkInterfaceMemory

Allocator

Operating System

5

Global Address Space in HPX-5

• Global memory is globally addressable

• Supports local, cyclic, blocked and user-defined distributions

• Two modes of operations:– Static (PGAS)– Active (AGAS)

• Asynchronous access to global memory

• Memory must be pinned locally prior to access

• GAS attributes, atomics, collectives

Network

Global Address SpaceLocalMemoryLocal

Memory

CTT

SW-AGASPGAS/HW-AGAS

Parcel/GAS Transport

6

Partitioned GlobalAddress Space

• Global Physical Memory• Physical location encoded

in the address• Memory space same as

address space• Faster address translation• Traditional PGAS

supports put/get/atomics• Static

7

Active GlobalAddress Space

• Global Virtual Memory• Physical location

maintained separately• Memory space distinct

from address space• Address translation

potentially expensive• Support arbitrary AMs• Dynamic (allows migration)

Load and Data Imbalance in Graph Algorithms

8

1D 2D

0

2

4

6

agas−ll

agas−p

red

agas−ra

ndom

agas−s

ucc

pgas

agas−ll

agas−p

red

agas−ra

ndom

agas−s

ucc

pgas

Dat

a Im

bala

nce

Graph500 RMAT (scale=20, partitions=1024)256 512

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

agas−ll

agas−p

red

agas−ra

ndom

agas−s

ucc

pgas

agas−ll

agas−p

red

agas−ra

ndom

agas−s

ucc

pgas

Dat

a D

istri

butio

n Va

rianc

e

Graph500 RMAT (scale=20, partitions=1024)

• Successor (agas-succ)• Predecessor (agas-pred)

Move the smallest partition to:• Random (agas-random)• Least Loaded (agas-ll)

AGAS Features

9

• User-defined data distributions– Existing PGAS models support cyclic, block-cyclic– Similar to Chapel, ZPL– Higher-order and composable

• Data Migration– Explicit moving of individual chunks– Rebalancing of global data– Implicit runtime-managed remapping based on introspection

• Co-data Actions– Relational dependencies between tasks and data– Runtime chooses between a) moving data to computation or b)

moving computation to data

• Caching, Data stealing, speculative migration

Software AGAS• Flat, byte-addressable global

address space• Chunk size limited to 4TB,

Nodes limited to 64K• Scalable memory allocators

handle local/cyclic segments– jemalloc, TBBMalloc

• Chunk table maps LVA à GVA• CTT (Chunk Translation Table)

maps GVA à LVA• CTT entry:

10

01101001001010010 101011110000001010

OS Virtual Memory

Scalable Global Memory Allocator

(Local)

Scalable Global Memory Allocator

(Cyclic)

HPX-5 AGAS Allocator

Chunk Table CTT

Local Chunk

Allocator

Cyclic Chunk

Allocator

0 32 64 128 192 256 288

count owner lva blocks on_unpin dist

Figure 5.5: A CTT entry.

Allocation is conceptually simple. The chunk size is rounded off to the nearest power of 2, and the log2(size)is stored in the size class bits in the address. This is necessary to implement blocked allocations in AGAS, as thatinformation is not available through the address arithmetic interface. To translate the LVA offset to a GVA, weneed a chunk table shared by both segments. With access to the scalable memory allocators metadata, a shadowallocation scheme could reduce this lookup down to O(1). Cyclic allocations are implemented using a symmetriccyclic heap managed by the root locality. This is conceptually simple and works fine in practice because a) cyclicallocations in HPX-5 are synchronous, and b) cyclic allocations are collective and require an update to the CTTon all localities. I will introduce an asynchronous lazy cyclic allocation scheme in HPX-5 implemented usinga segmented symmetric cyclic heap. For memory allocations with arbitrary distributions, the distribution mapwill be stored in the CTT to be refered to during address arithmetic.

The deallocation procedure is similar to allocation, but for one complication: deallocation has to check for anyoutstanding “pin” counts before freeing the chunks. As HPX-5 supports asynchronous deallocation, to avoid wait-ing threads, deallocation inserts a “continuation parcel” in the CTT to be released when the “pin” count reaches 0.

Chunk Translation: Chunk translation is performed through a try_pin operation which bumps the “pin”count associated with the chunk entry (refer Figure 5.5) in the CTT. The chunk table and CTT are presentlyimplemented using a fast, scalable cuckoo hashtable implementation. The CTT must maintain the currentmapping from the blocks that it owns to their corresponding base virtual addresses, as well as any additionalstate required. The initial residing locality of the chunk (pointed to by the distribution map) is considered as thehome of the block. When a chunk is moved to another locality, the owner field in the CTT entry is updated to pointto the destination locality. In HPX-5, the CTT is distributed such that only the entries for the “home” chunksare maintained at each locality. When chunks move, the right existing entries are set to forward to the ownerand new entries are inserted. For looking up a GVA, if an entry in the CTT is not found, the “home” bits of theaddress are used. If the chunk is owned locally, the parcel addressed to that chunk is directly spawned as a threadon that locality. In case of arbitrary distributions, the corresponding CTT entries are broadcasted to all of thelocalities. Going forward, I plan to separate the translation table from the routing table to aid in co-design withhardware-based directory implementations of AGAS. Should resolution be an issue, I will implement replicatedconcurrent chunk tries, as they typically have lower storage requirements in the average case. Tries can coalescecontiguous regions of memory and have to maintain nodes for the “holes” in the address space caused by move.

Mover Home Owner

move

move

Target

rebind

complete

complete

owns? rebind

Figure 5.6: AGAS Move

Move Protocol: The various AGAS implementations share the high-level move protocol diagrammed inFigure 5.6. A mover initiates a move of a source chunk to a destination locality by sending a move operationto the destination containing the address of the source. The destination can concurrently allocate space for thechunk, update the routing for the chunk if necessary, and send a completion message to the source. The sourceinvalidates its local mapping—replacing it with a forwarding record for the destination when using softwareAGAS. It waits for all local readers or writers to drain away, and then sends the chunk data to the destination.Any new pin requests are disallowed by returning a false. Once the destination has received the entire chunkit inserts the local address translation locally.

13

Cyclic Allocation

• Symmetric cyclic segment/heap• Can be segmented to speed up cyclic allocation

– At the expense of limiting allocation size

11

1. alloc(N)2. reservememory

3. bcast alloc

4. alloc local memand insert ctt entry

Move

• When a chunk moves, its owner is updated

• Move protocol– Overlaps routing updates and

data transfer

12

Mover Home Owner

move

move

Target

rebind

complete

complete

owns? rebind

• Home always “knows” where a chunk is• Current implementation uses a two-sided

transport– Relies on being able to “resend” a parcel

• Complex interactions between move, unpin and delete

Hardware AGAS

• Network-managed global address space

• CTT implemented by TCAM in hardware

• The network “knows” the location of chunks

• Pros:– Message optimal in terms of hops

required for routing

• Cons:– Centralized directory– Existing hardware limitations

13

1. request(666)

3. reply

Network

2. resolve

0x0-0x1FF

0x200-0x3FF

0x400-0x5FF

0x600-0x7FF

0x800-0x9FF

S0

S4

Hardware AGAS• Proof of concept based on

the GASNet runtime system

• Uses SDN for routing table management

• Uses IB RD multicast• IB multicast GIDs are

mapped to multicast Ethernet MAC addresses

• Put performance:– High switching latency– Photon conduit overheads

14

S

H

TSwitch

Home(Rank 0)

Source(Rank 1)

Target(Rank 2)

Ranks 0, 1, 2 push OpenFlow mods:

33:33:00:00:00:01 ➝ port133:33:00:00:00:02 ➝ port233:33:00:00:00:03 ➝ port3

Ranks 0, 1 and 2 locally attach to multicast addresses

based on their page id:0 ➝ ff0e::ffff::11 ➝ ff0e::ffff::22 ➝ ff0e::ffff::3

1

2

Endpoints attach multicast addresses for all block-ids (ff0e::ffff:).The switch does forwarding based on the destination MAC address associated with the block

id (33:33:)

3

4 6

5,7

8

16

32

64

128

256

512

1024

512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Tim

e ta

ken

(us)

Message Size (bytes)

software (IBV)software (Photon)

hardware

Comparing Hardware and Software AGAS

• Bounded software cache– Cache-replacement policies

(random, LRU)– Evictions cause cache

misses and degrade GAS perfomance

• Thread contention in concurrent cache– Concurrent cache is faster

than a single-threaded cache but 25% slower than the hardware implementation

15

6.0e+05

6.5e+05

7.0e+05

7.5e+05

8.0e+05

8.5e+05

9.0e+05

0 1 2 3 4 5 6

Glob

al U

pdat

es P

er S

econ

d

Number of contending threads per process

hwsw (single-lock cache)sw (concurrent cache)

4.0e+054.2e+054.4e+054.6e+054.8e+055.0e+055.2e+055.4e+055.6e+055.8e+056.0e+056.2e+05

1000 1500 2000 2500 3000 3500 4000

Glob

al U

pdat

es P

er S

econ

d

Maximum Cache Entries

sw (random)sw (LRU)

hw

Remap performance in HW AGAS

• GUPS microbenchmark– table size 2048 words, 512 words/page, 16 nodes (192 cores)– 4 million random updates were performed

• As page movement frequency increases, SW becomes more expensive than HW

16

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Conclusions

• The HPX-5 runtime system provides dynamic, adaptive features necessary for execution of large-scale irregular applications.

• Global virtual memory using an active global address space (AGAS) enables the runtime to manage both locality and load-imbalance concerns.

• Network-managed AGAS shows promise at smaller scales but requires support from vendors and supercomputing centers to be viable at larger scales.

17

Questions?

18

18

Active Global Address Space (AGAS)sc15.supercomputing.org/sites/all/themes/SC15images/doctoral_showc… · This is necessary to implement blocked allocations in AGAS, as that information

Documents