Top Banner
LSDS Large-Scale Distributed Systems Group Window-Based Hybrid Stream Processing for Heterogeneous Architectures Alexandros Koliousis [email protected] Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa & Peter Pietzuch Large-Scale Distributed Systems Group Department of Computing, Imperial College London http://lsds.doc.ic.ac.uk github.com/lsds/saber
17

Alexandros Koliousis - University of CambridgeAlexandros Koliousis [email protected] Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa

Jan 26, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • LSDS Large-Scale Distributed Systems Group

    Window-Based Hybrid Stream Processing for Heterogeneous Architectures

    Alexandros [email protected]

    Joint work with Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa & Peter Pietzuch

    Large-Scale Distributed Systems GroupDepartment of Computing, Imperial College Londonhttp://lsds.doc.ic.ac.uk

    github.com/lsds/saber

  • LSDS Large-Scale Distributed Systems Group

    High-Throughput Low-Latency Analytics

    2

    Google Zeitgeist40Kuser queries/sWithin ms

    Feedzai40K card trans/s In 25 ms

    NovaSparks150M stock options/s In less than 1 ms

    Facebook Insights9GB of page metrics/sIn less than 10 s

    tt+1

    window

  • LSDS Large-Scale Distributed Systems Group 3

    L3

    C1C2C3C4

    C5C6C7C8

    L3

    C1C2C3C4

    C5C6C7C8

    L2 Cache

    DRAM DRAM

    Processor1 ... N

    Sock

    et 1

    Sock

    et 2

    Command QueuePCIe Bus

    DMA

    10s ofstreaming processors

    Exploit Single-Node Heterogeneous Hardware

    Servers with CPUs and GPUs now common– 10x higher linear memory access throughput– Limited data transfer throughput

    1000s of cores

    10s GB ofRAM

    Use both CPU & GPU resources for stream processing

  • LSDS Large-Scale Distributed Systems Group

    CQL: SQL-based declarative language for continuous queries [Arasu et al., VLDBJ’06]Credit card fraud detection example:

    – Find attempts to use same card in different regions within 5-min window

    4

    select distinct W.cidfrom Payments [range 300 seconds] as W,

    Payments [partition-by 1 row] as Lwhere W.cid = L.cid and W.region != L.region

    CQL offers correct window semantics

    With Well-Defined High-Level Queries

    Self-join

  • LSDS Large-Scale Distributed Systems Group

    Challenges & Contributions1. How to parallelise sliding-window queries across CPU and GPU?Decouple query semantics from system parameters

    2. When to use CPU or GPU for a CQL operator?Hybrid processing: offload tasks to both CPU and GPU

    3. How to reduce GPU data movement costs?Amortise data movement delays with deep pipelining

    5

    SABERWindow-Based Hybrid Stream Processing Engine for CPUs & GPUs

  • LSDS Large-Scale Distributed Systems Group

    Task T2

    Task T1

    Problem: Window semantics affect system throughput and latency

    – Pick task size based on window size?

    6

    123456

    How to Parallelise Window Computation?

    Window-based parallelism results in redundant computation

    size: 4 secslide: 1 sec

    Output window results in order

  • LSDS Large-Scale Distributed Systems Group

    Problem: Window semantics affect system throughput and latency

    – Pick task size based on window size?

    7

    On window slide?

    How to Parallelise Window Computation?

    Slide-based parallelism limits GPU parallelism

    123456 size: 4 secslide: 1 sec

    T1T2

    T3T4

    T5

    Compose window results from partial results

  • LSDS Large-Scale Distributed Systems Group

    Idea: Decouple task size from window size/slide– Pick based on underlying hardware features

    • e.g. PCIe throughput

    8

    10 9 8 7 6 5 4 3 2 115 14 13 12 11

    – Task contains one or more window fragments• E.g. closing/pending/opening windows in T2

    SABER’s Window Processing Model

    T1T2T3

    w1w2

    w3w4

    w5

    size: 7 rowsslide: 2 rows

    5 tuples/task

  • LSDS Large-Scale Distributed Systems Group

    Worker A stores T1 results, merges window fragment results and forwards complete windows downstream

    Idea: Decouple task size from window size/slide– Assemble window fragment results– Output them in correct order

    9

    Worker B: T2

    w1w2w3

    w4w5

    Worker A: T1w1

    w2w3

    w1result

    w2result

    Result StageSlot 2 Slot 1

    Output result circular buffer

    Merging Window Fragment Results

  • LSDS Large-Scale Distributed Systems Group

    Challenges & Contributions1. How to parallelise sliding-window queries across CPU and GPU?Decouple query semantics from system parameters

    2. When to use CPU or GPU for a CQL operator?Hybrid processing: offload tasks to both CPU and GPU

    3. How to reduce GPU data movement costs?Amortise data movement delays with deep pipelining

    10

    SABERWindow-Based Hybrid Stream Processing Engine for CPUs & GPUs

  • LSDS Large-Scale Distributed Systems Group

    Idea: Enable tasks to run on both processors– Scheduler assigns tasks to idle processors

    11

    CPU GPUQA 3 ms 2 msQB 3 ms 1 ms

    T2 T1T3T4T5T6T7T8T9

    QBQAQBQBQBQBQAQBQA

    T10

    QA

    Task Queue: CPU

    GPU

    0 3 6 9 12

    CPUGPU

    First-Come First-Served

    T1 T4 T8T2 T3 T5 T6 T7 T9

    T10

    SABER’s Hybrid Stream Processing Model

    FCFS ignores effectiveness of processor for given task

    Past behavior:comes first

    Idle

  • LSDS Large-Scale Distributed Systems Group

    Idea: Idle processor skips tasks that could be executed faster by another processor

    – Decision based on observed query task throughput

    12

    T2 T1T3T4T5T6T7T8T9

    QBQAQBQBQBQBQAQBQA

    T10

    QA

    Task Queue:

    0 3 6 9 12

    CPUGPU

    HLS

    T3T2T1

    T7 T10T4 T5 T6

    CPU

    GPUCPU GPU

    QA 3 ms 2 msQB 3 ms 1 ms

    0 3 6 9 12

    Heterogeneous Look-Ahead Scheduler (HLS)

    HLS fully utilises processors

    T9T8

    Past behavior:comes first

  • LSDS Large-Scale Distributed Systems Group 13

    T1

    T2

    T2 T1

    op

    ααop

    CPU

    GPU

    T1 T2

    The SABER Architecture

    Scheduling & execution stage

    Dequeue tasks based on HLS

    Dispatching stage

    Dispatch fixed-size tasks

    Merge & forward partial window results

    Result stage

    Java15K LOC

    C & OpenCL4K LOC

  • LSDS Large-Scale Distributed Systems Group 14

    0

    10

    20

    30

    40

    50

    CM2 SG1 SG2 LRB3 LRB4Thro

    ughp

    ut (1

    06tu

    ples

    /s)

    SABER (CPU contrib.)

    SABER (GPU contrib.)

    Cluster Mgmt. Smart Grid LRB

    Is Hybrid Stream Processing Effective?

    Different queries result in different CPU:GPU processing split that is hard to predict offline

    aggravg group-byavg select

    group-byavg group-bycnt

    group-bycntgroup-byavg

    select

    Intel Xeon 2.6 GHz

    NVIDIA Quadro K5200

    16 cores

    2,304 cores

  • LSDS Large-Scale Distributed Systems Group 15

    0

    2

    4

    6

    Thro

    ughp

    ut (G

    B/s)

    0

    0.1

    0.2

    0.3 SABER (CPU only)SABER (GPU only)SABER

    Is Hybrid Stream Processing Effective?

    Aggregate throughput of CPU and GPU always higher than its counterparts

    Aggregation Group-by θ-join

    GPU is faster CPU is faster Not additive due to queue contention

  • LSDS Large-Scale Distributed Systems Group 16

    W1 benefits from static scheduling but HLS fully utilises GPU:– GPU also runs ~%1 of of group-by tasks

    W2 benefits from FCFS but HLS better utilises GPU:– HLS CPU:GPU split is 1:2.5 for project and 1:0.5 for αggr

    Is Heterogeneous Look-Ahead Scheduling Effective?

    0

    1

    2

    3

    4

    5

    W1 W2

    Thro

    ughp

    ut (G

    B/s) FCFS

    StaticHLS

    CPU GPUπ 5xγ 6x

    CPU GPUπ 1.5xα 1.5x W1 W2

    W1 W2

    group-bycnt

    project

    aggrsum

    project

  • LSDS Large-Scale Distributed Systems Group

    Window processing modelDecouples query semantics from system parameters

    Hybrid stream processing model Can achieve aggregate throughput of heterogeneous processors

    Hybrid Look-ahead Scheduling (HLS) Allows use of both CPU and GPU opportunistically for arbitrary workloads

    17

    Alexandros Koliousisgithub.com/lsds/saber

    Thank you! Any Questions?

    Summary