Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Harp: Collective Communication on Hadoop

Bingjing Zhang, Yang Ruan, Judy Qiu

Outline• Motivations

– Why do we bring collective communications to big data processing?• Collective Communication Abstractions

– Our approach to optimize data movement– Hierarchical data abstractions and operations defined on top of them

• MapCollective Programming Model– Extended from MapReduce model to support collective communications– Two Level BSP parallelism

• Harp Implementation– A plugin on Hadoop– Component layers and the job flow

• Experiments• Conclusion

Motivation

K-means Clustering in (Iterative) MapReduce

K-means Clustering in Collective Communication

gatherM: Compute local points sumR: Compute global centroids

broadcast

shuffle

M M M M

RR

M M M M

allreduce

M: Control iterations and compute local points sum

More efficient and much simpler!

Large Scale Data Analysis Applications• Iterative Applications

– Cached and reused local data between iterations– Complicated computation steps– Large intermediate data in communications– Various communication patterns

Computer Vision Complex NetworksBioinformatics Deep Learning

The Models of Contemporary Big Data Tools

MapReduce ModelDAG Model Graph Model BSP/Collective Model

Storm

TwisterFor Iterations

/Learning

For Streaming

For Query

S4

Hadoop

DryadLINQ Pig

Spark

Spark SQL

Spark Streaming

MRQL

HiveTez

GiraphHama

GraphLab

HarpGraphX

HaLoop

Samza

DryadStratosphere / Flink

Many of them have fixed communication patterns!

Contributions

Parallelism Model Architecture

ShuffleM M M M

Collective Communication

M M M M

R R

MapCollective ModelMapReduce Model

YARN

MapReduce V2

Harp

MapReduce Applications

MapCollective ApplicationsApplication

Framework

Resource Manager

Collective Communication Abstractions• Hierarchical Data Abstractions

– Basic Types• Arrays, key-values, vertices, edges and messages

– Partitions• Array partitions, key-value partitions, vertex partitions, edge partitions and

message partitions

– Tables• Array tables, key-value tables, vertex tables, edge tables and message tables

• Collective Communication Operations– Broadcast, allgather, allreduce– Regroup– Send messages to vertices, send edges to vertices

Hierarchical Data Abstractions

Vertex Table

Key-Value Partition

Array

Transferable

Key-ValuesVertices, Edges,

MessagesDouble Array

Int Array

Long Array

Array Partition <Array Type>

Object

Vertex Partition

Edge Partition

Array Table <Array Type>

Message Partition

Key-Value Table

Byte Array

Message Table

EdgeTable

broadcast, send

broadcast, allgather, allreduce, regroup, message-to-vertex…

broadcast, send

Table

Partition

Basic Types

Example: regroup

Table

Partition 0

Table

Process 0 Process 1 Process 2

Partition 1

Table

Partition 0

Partition 1

Regroup

Partition 2

Partition 3

Partition 4

Partition 0

Partition 2

Partition 3 Partition 4

Partition 1 Partition 2Partition 2Partition 0

OperationsOperation Name Data Abstraction Algorithm Time Complexity

broadcastarrays, key-valuepairs & vertices

chain

allgatherarrays, key-valuepairs & vertices

bucket

allreducearrays, key-valuepairs

bi-directionalexchange

regroup-allgather 2

regroup arrays, key-valuepairs & vertices

point-to-pointdirect sending

send messagesto vertices

messages,vertices

point-to-pointdirect sending

send edges tovertices

edges, verticespoint-to-pointdirect sending

MapCollective Programming Model• BSP parallelism

– Inter node parallelism and inner node parallelism

Process Level

Thread Level

Process Level

The Harp Library• Hadoop Plugin which targets on Hadoop 2.2.0• Provides implementation of the collective communication abstractions

and MapCollective programming model

• Project Link– http://salsaproj.indiana.edu/harp/index.html

• Source Code Link– https://github.com/jessezbj/harp-project

https://github.com/jessezbj/harp-project

https://github.com/jessezbj/harp-project

YARN

MapReduce V2

Harp

MapReduce Applications MapCollective Applications

Component Layers

MapReduce

Collective Communication Abstractions

MapCollective Programming Model

Applications: K-Means, WDA-SMACOF, Graph-Drawing…

Collective Communication Operators

Hierarchical Data Types (Tables & Partitions)

Memory Resource Pool

Collective Communication APIs

Array, Key-Value, Graph Data Abstraction

MapCollective Interface

Task Management

A MapCollective Job

YARN Resource Manager

Client

MapCollective Runner

1. Record Map task locations from original MapReduce AppMaster

MapCollectiveAppMaster

MapCollectiveContainerLauncher

MapCollectiveContainerAllocator

I. Launch AppMaster

II. Launch Tasks

CollectiveMapper

setup

mapCollective

cleanup

3. Invoke collective communication APIs

4. Write output to HDFS

2. Read key-value pairs

Experiments• Applications

– K-means Clustering– Force-directed Graph Drawing Algorithm– WDA-SMACOF

• Test Environment– Big Red II

• http://kb.iu.edu/data/bcqt.html

http://kb.iu.edu/data/bcqt.html

http://kb.iu.edu/data/bcqt.html

K-means Clustering

M M M M

allreduce centroids

0 20 40 60 80 100 120 1400

1000

2000

3000

4000

5000

6000

0

20

40

60

80

100

120

140

500M points 10K centroids Execution Time5M points 1M centroids Execution Time500M points 10K centroids Speedup5M points 1M centroids Speedup

Number of Nodes

Exec

ution

Tim

e (S

econ

ds)

Speedup

Force-directed Graph Drawing Algorithm

T. Fruchterman, M. Reingold. “Graph Drawing by Force-Directed Placement”, Software Practice & Experience 21 (11), 1991.

M M M M

allgather positions of vertices

0 20 40 60 80 100 120 1400

1000

2000

3000

4000

5000

6000

7000

8000

0

10

20

30

40

50

60

70

80

90

Execution Time Speedup

Number of Nodes

Exec

ution

Tim

e (S

econ

ds)

Speedup

WDA-SMACOF

Y. Ruan et al. “A Robust and Scalable Solution for Interpolative Multidimensional Scaling With Weighting”. E-Science, 2013.

M M M M

allreduce the stress value

allgather and allreduce results in the conjugate gradient process

0 20 40 60 80 100 120 1400

5001000150020002500300035004000

100K points 200K points 300K points400K points

Number of Nodes

Exec

ution

Tim

e (s

econ

ds)

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

100K points 200K points 300K pointsNumber of Nodes

Spee

dup

Conclusions• Harp is an implementation designed in a pluggable way to bring high

performance to the Apache Big Data Stack and bridge the differences between Hadoop ecosystem and HPC system through a clear communication abstraction, which did not exist before in the Hadoop ecosystem.

• The experiments show that with Harp we can scale three applications to 128 nodes with 4096 CPUs on the Big Red II supercomputer, where the speedup in most tests is close to linear.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Documents

local data

big data processing

keyvalue tables

keyvalue partitions

edge tables

vertex tables

edge partitions

vertex partitions