Top Banner
Active Global Address Space (AGAS): Global Virtual Memory for Dynamic Adaptive Many-Tasking (AMT) Runtimes Abhishek Kulkarni Center for Research in Extreme Scale Technologies Indiana University November 19, 2015 Dissertation Research Showcase Presentation
18

Active Global Address Space (AGAS)sc15.supercomputing.org/sites/all/themes/SC15images/doctoral_showc… · This is necessary to implement blocked allocations in AGAS, as that information

Feb 02, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Active Global Address Space (AGAS): Global Virtual Memory for Dynamic Adaptive

    Many-Tasking (AMT) Runtimes

    Abhishek KulkarniCenter for Research in Extreme Scale Technologies

    Indiana UniversityNovember 19, 2015

    Dissertation Research Showcase Presentation

  • HPC Programming / Execution models

    2

    Message Passing Asynchronous Many-tasking (AMT)

    • BSP: Compute-Communicate paradigm

    • Data divided statically among available processors

    • Lightweight tasks concurrently operating on global data

    • Well-suited for dynamic and adaptive execution

    ✓ ✓

  • Asynchronous Many-Tasking (AMT) models

    • Static/Dynamic task and dataflow graph• Lightweight threads• Inter-thread synchronization

    • Active Messages• Global Address Space

    3

    5

    6

    4

    7 8

    9

    1 3

    2

    0,0 0,1 0,2

    1,0 1,1 1,2

    2,0 2,1 2,2

    5 4

    0

    2

    Task Graph Data Distribution

    Logical Representation

    Task-p

    arallel

    runtim

    es

    Network

    Global Address SpaceLocalMemoryLocal

    Memory

    CTT

    SW-AGASPGAS/HW-AGAS

    Parcel/GAS Transport

    Distri

    buted

    execut

    ion

  • ParalleX Execution Model• Lightweight multi-threading

    – Divides work into smaller tasks– Increases concurrency

    • Message-driven computation– Move work to data– Keeps work local, stops blocking

    • Constraint-based synchronization– Declarative criteria for work– Event driven– Eliminates global barriers

    • Data-directed execution– Merger of flow control and data

    structure• Shared name space

    – Global address space– Simplifies random gathers

    4

  • High Performance ParalleX

    • The HPX runtime system reifies the ParalleXexecution model:– Localities– Lightweight Threads– Processes– Local Control Objects (LCOs)– Parcels (Active Messages)– Active Global Address Space (AGAS)

    • Fine-grained concurrency– Tasks, Interrupts, Lightweight Threads

    • Lightweight, globally-addressable synchronization objects– Eg: Futures (IVars), Monotonic counters (Gate), Reductions

    http://hpx.crest.iu.edu/

    Global Address SpaceThread

    Scheduler

    MemoryCores

    NIC

    LCOManager

    ProcessManager

    ActionManager

    ParcelTransport

    NetworkInterfaceMemory

    Allocator

    Operating System

    5

  • Global Address Space in HPX-5

    • Global memory is globally addressable

    • Supports local, cyclic, blocked and user-defined distributions

    • Two modes of operations:– Static (PGAS)– Active (AGAS)

    • Asynchronous access to global memory

    • Memory must be pinned locally prior to access

    • GAS attributes, atomics, collectives

    Network

    Global Address SpaceLocalMemoryLocal

    Memory

    CTT

    SW-AGASPGAS/HW-AGAS

    Parcel/GAS Transport

    6

  • Partitioned GlobalAddress Space

    • Global Physical Memory• Physical location encoded

    in the address• Memory space same as

    address space• Faster address translation• Traditional PGAS

    supports put/get/atomics• Static

    7

    Active GlobalAddress Space

    • Global Virtual Memory• Physical location

    maintained separately• Memory space distinct

    from address space• Address translation

    potentially expensive• Support arbitrary AMs• Dynamic (allows migration)

  • Load and Data Imbalance in Graph Algorithms

    8

    1D 2D

    0

    2

    4

    6

    agas−ll

    agas−p

    red

    agas−ra

    ndom

    agas−s

    ucc

    pgas

    agas−ll

    agas−p

    red

    agas−ra

    ndom

    agas−s

    ucc

    pgas

    Dat

    a Im

    bala

    nce

    Graph500 RMAT (scale=20, partitions=1024)256 512

    0e+00

    1e+08

    2e+08

    3e+08

    4e+08

    5e+08

    agas−ll

    agas−p

    red

    agas−ra

    ndom

    agas−s

    ucc

    pgas

    agas−ll

    agas−p

    red

    agas−ra

    ndom

    agas−s

    ucc

    pgas

    Dat

    a D

    istri

    butio

    n Va

    rianc

    e

    Graph500 RMAT (scale=20, partitions=1024)

    • Successor (agas-succ)• Predecessor (agas-pred)

    Move the smallest partition to:• Random (agas-random)• Least Loaded (agas-ll)

  • AGAS Features

    9

    • User-defined data distributions– Existing PGAS models support cyclic, block-cyclic– Similar to Chapel, ZPL– Higher-order and composable

    • Data Migration– Explicit moving of individual chunks– Rebalancing of global data– Implicit runtime-managed remapping based on introspection

    • Co-data Actions– Relational dependencies between tasks and data– Runtime chooses between a) moving data to computation or b)

    moving computation to data

    • Caching, Data stealing, speculative migration

  • Software AGAS• Flat, byte-addressable global

    address space• Chunk size limited to 4TB,

    Nodes limited to 64K• Scalable memory allocators

    handle local/cyclic segments– jemalloc, TBBMalloc

    • Chunk table maps LVA à GVA• CTT (Chunk Translation Table)

    maps GVA à LVA• CTT entry:

    10

    01101001001010010 101011110000001010

    OS Virtual Memory

    Scalable Global Memory Allocator

    (Local)

    Scalable Global Memory Allocator

    (Cyclic)

    HPX-5 AGAS Allocator

    Chunk Table CTT

    Local Chunk

    Allocator

    Cyclic Chunk

    Allocator

    0 32 64 128 192 256 288

    count owner lva blocks on_unpin dist

    Figure 5.5: A CTT entry.

    Allocation is conceptually simple. The chunk size is rounded off to the nearest power of 2, and the log2(size)is stored in the size class bits in the address. This is necessary to implement blocked allocations in AGAS, as thatinformation is not available through the address arithmetic interface. To translate the LVA offset to a GVA, weneed a chunk table shared by both segments. With access to the scalable memory allocators metadata, a shadowallocation scheme could reduce this lookup down to O(1). Cyclic allocations are implemented using a symmetriccyclic heap managed by the root locality. This is conceptually simple and works fine in practice because a) cyclicallocations in HPX-5 are synchronous, and b) cyclic allocations are collective and require an update to the CTTon all localities. I will introduce an asynchronous lazy cyclic allocation scheme in HPX-5 implemented usinga segmented symmetric cyclic heap. For memory allocations with arbitrary distributions, the distribution mapwill be stored in the CTT to be refered to during address arithmetic.

    The deallocation procedure is similar to allocation, but for one complication: deallocation has to check for anyoutstanding “pin” counts before freeing the chunks. As HPX-5 supports asynchronous deallocation, to avoid wait-ing threads, deallocation inserts a “continuation parcel” in the CTT to be released when the “pin” count reaches 0.

    Chunk Translation: Chunk translation is performed through a try_pin operation which bumps the “pin”count associated with the chunk entry (refer Figure 5.5) in the CTT. The chunk table and CTT are presentlyimplemented using a fast, scalable cuckoo hashtable implementation. The CTT must maintain the currentmapping from the blocks that it owns to their corresponding base virtual addresses, as well as any additionalstate required. The initial residing locality of the chunk (pointed to by the distribution map) is considered as thehome of the block. When a chunk is moved to another locality, the owner field in the CTT entry is updated to pointto the destination locality. In HPX-5, the CTT is distributed such that only the entries for the “home” chunksare maintained at each locality. When chunks move, the right existing entries are set to forward to the ownerand new entries are inserted. For looking up a GVA, if an entry in the CTT is not found, the “home” bits of theaddress are used. If the chunk is owned locally, the parcel addressed to that chunk is directly spawned as a threadon that locality. In case of arbitrary distributions, the corresponding CTT entries are broadcasted to all of thelocalities. Going forward, I plan to separate the translation table from the routing table to aid in co-design withhardware-based directory implementations of AGAS. Should resolution be an issue, I will implement replicatedconcurrent chunk tries, as they typically have lower storage requirements in the average case. Tries can coalescecontiguous regions of memory and have to maintain nodes for the “holes” in the address space caused by move.

    Mover Home Owner

    move

    move

    Target

    rebind

    complete

    complete

    owns? rebind

    Figure 5.6: AGAS Move

    Move Protocol: The various AGAS implementations share the high-level move protocol diagrammed inFigure 5.6. A mover initiates a move of a source chunk to a destination locality by sending a move operationto the destination containing the address of the source. The destination can concurrently allocate space for thechunk, update the routing for the chunk if necessary, and send a completion message to the source. The sourceinvalidates its local mapping—replacing it with a forwarding record for the destination when using softwareAGAS. It waits for all local readers or writers to drain away, and then sends the chunk data to the destination.Any new pin requests are disallowed by returning a false. Once the destination has received the entire chunkit inserts the local address translation locally.

    13

  • Cyclic Allocation

    • Symmetric cyclic segment/heap• Can be segmented to speed up cyclic allocation

    – At the expense of limiting allocation size

    11

    1. alloc(N)2. reservememory

    3. bcast alloc

    4. alloc local memand insert ctt entry

  • Move

    • When a chunk moves, its owner is updated

    • Move protocol– Overlaps routing updates and

    data transfer

    12

    Mover Home Owner

    move

    move

    Target

    rebind

    complete

    complete

    owns? rebind

    • Home always “knows” where a chunk is• Current implementation uses a two-sided

    transport– Relies on being able to “resend” a parcel

    • Complex interactions between move, unpin and delete

  • Hardware AGAS

    • Network-managed global address space

    • CTT implemented by TCAM in hardware

    • The network “knows” the location of chunks

    • Pros:– Message optimal in terms of hops

    required for routing

    • Cons:– Centralized directory– Existing hardware limitations

    13

    1. request(666)

    3. reply

    Network

    2. resolve

    0x0-0x1FF

    0x200-0x3FF

    0x400-0x5FF

    0x600-0x7FF

    0x800-0x9FF

    S0

    S4

  • Hardware AGAS• Proof of concept based on

    the GASNet runtime system

    • Uses SDN for routing table management

    • Uses IB RD multicast• IB multicast GIDs are

    mapped to multicast Ethernet MAC addresses

    • Put performance:– High switching latency– Photon conduit overheads

    14

    S

    H

    TSwitch

    Home(Rank 0)

    Source(Rank 1)

    Target(Rank 2)

    Ranks 0, 1, 2 push OpenFlow mods:

    33:33:00:00:00:01 ➝ port133:33:00:00:00:02 ➝ port233:33:00:00:00:03 ➝ port3

    Ranks 0, 1 and 2 locally attach to multicast addresses

    based on their page id:0 ➝ ff0e::ffff::11 ➝ ff0e::ffff::22 ➝ ff0e::ffff::3

    1

    2

    Endpoints attach multicast addresses for all block-ids (ff0e::ffff:).The switch does forwarding based on the destination MAC address associated with the block

    id (33:33:)

    3

    4 6

    5,7

    8

    16

    32

    64

    128

    256

    512

    1024

    512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

    Tim

    e ta

    ken

    (us)

    Message Size (bytes)

    software (IBV)software (Photon)

    hardware

  • Comparing Hardware and Software AGAS

    • Bounded software cache– Cache-replacement policies

    (random, LRU)– Evictions cause cache

    misses and degrade GAS perfomance

    • Thread contention in concurrent cache– Concurrent cache is faster

    than a single-threaded cache but 25% slower than the hardware implementation

    15

    6.0e+05

    6.5e+05

    7.0e+05

    7.5e+05

    8.0e+05

    8.5e+05

    9.0e+05

    0 1 2 3 4 5 6

    Glob

    al U

    pdat

    es P

    er S

    econ

    d

    Number of contending threads per process

    hwsw (single-lock cache)sw (concurrent cache)

    4.0e+054.2e+054.4e+054.6e+054.8e+055.0e+055.2e+055.4e+055.6e+055.8e+056.0e+056.2e+05

    1000 1500 2000 2500 3000 3500 4000

    Glob

    al U

    pdat

    es P

    er S

    econ

    d

    Maximum Cache Entries

    sw (random)sw (LRU)

    hw

  • Remap performance in HW AGAS

    • GUPS microbenchmark– table size 2048 words, 512 words/page, 16 nodes (192 cores)– 4 million random updates were performed

    • As page movement frequency increases, SW becomes more expensive than HW

    16

    ��

    ����

    ��

    ����

    ��

    ����

    ��

    �� �� �� �� �� �� �� �� �� ���

    ��������������

    ���������������������

    ������������������������

    ��

    ���

    ����

    �����

    �� �� �� �� �� �� �� �� �� ���

    ��������������

    ���������������������

    ��������������������

  • Conclusions

    • The HPX-5 runtime system provides dynamic, adaptive features necessary for execution of large-scale irregular applications.

    • Global virtual memory using an active global address space (AGAS) enables the runtime to manage both locality and load-imbalance concerns.

    • Network-managed AGAS shows promise at smaller scales but requires support from vendors and supercomputing centers to be viable at larger scales.

    17

  • Questions?

    18

    18