Top Banner
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu
19

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Jan 16, 2016

Download

Documents

Blake Brooks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Harp: Collective Communication on Hadoop

Bingjing Zhang, Yang Ruan, Judy Qiu

Page 2: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Outline• Motivations

– Why do we bring collective communications to big data processing?• Collective Communication Abstractions

– Our approach to optimize data movement– Hierarchical data abstractions and operations defined on top of them

• MapCollective Programming Model– Extended from MapReduce model to support collective communications– Two Level BSP parallelism

• Harp Implementation– A plugin on Hadoop– Component layers and the job flow

• Experiments• Conclusion

Page 3: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Motivation

K-means Clustering in (Iterative) MapReduce

K-means Clustering in Collective Communication

gatherM: Compute local points sumR: Compute global centroids

broadcast

shuffle

M M M M

RR

M M M M

allreduce

M: Control iterations and compute local points sum

More efficient and much simpler!

Page 4: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Large Scale Data Analysis Applications• Iterative Applications

– Cached and reused local data between iterations– Complicated computation steps– Large intermediate data in communications– Various communication patterns

Computer Vision Complex NetworksBioinformatics Deep Learning

Page 5: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

The Models of Contemporary Big Data Tools

MapReduce ModelDAG Model Graph Model BSP/Collective Model

Storm

TwisterFor Iterations

/Learning

For Streaming

For Query

S4

Hadoop

DryadLINQ Pig

Spark

Spark SQL

Spark Streaming

MRQL

HiveTez

GiraphHama

GraphLab

HarpGraphX

HaLoop

Samza

DryadStratosphere / Flink

Many of them have fixed communication patterns!

Page 6: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Contributions

Parallelism Model Architecture

ShuffleM M M M

Collective Communication

M M M M

R R

MapCollective ModelMapReduce Model

YARN

MapReduce V2

Harp

MapReduce Applications

MapCollective ApplicationsApplication

Framework

Resource Manager

Page 7: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Collective Communication Abstractions• Hierarchical Data Abstractions

– Basic Types• Arrays, key-values, vertices, edges and messages

– Partitions• Array partitions, key-value partitions, vertex partitions, edge partitions and

message partitions

– Tables• Array tables, key-value tables, vertex tables, edge tables and message tables

• Collective Communication Operations– Broadcast, allgather, allreduce– Regroup– Send messages to vertices, send edges to vertices

Page 8: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Hierarchical Data Abstractions

Vertex Table

Key-Value Partition

Array

Transferable

Key-ValuesVertices, Edges,

MessagesDouble Array

Int Array

Long Array

Array Partition <Array Type>

Object

Vertex Partition

Edge Partition

Array Table <Array Type>

Message Partition

Key-Value Table

Byte Array

Message Table

EdgeTable

broadcast, send

broadcast, allgather, allreduce, regroup, message-to-vertex…

broadcast, send

Table

Partition

Basic Types

Page 9: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Example: regroup

Table

Partition 0

Table

Process 0 Process 1 Process 2

Partition 1

Table

Partition 0

Partition 1

Regroup

Partition 2

Partition 3

Partition 4

Partition 0

Partition 2

Partition 3 Partition 4

Partition 1 Partition 2Partition 2Partition 0

Page 10: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

OperationsOperation Name Data Abstraction Algorithm Time Complexity

broadcastarrays, key-valuepairs & vertices

chain

allgatherarrays, key-valuepairs & vertices

bucket

allreducearrays, key-valuepairs

bi-directionalexchange

regroup-allgather 2

regroup arrays, key-valuepairs & vertices

point-to-pointdirect sending

send messagesto vertices

messages,vertices

point-to-pointdirect sending

send edges tovertices

edges, verticespoint-to-pointdirect sending

Page 11: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

MapCollective Programming Model• BSP parallelism

– Inter node parallelism and inner node parallelism

Process Level

Thread Level

Process Level

Page 12: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

The Harp Library• Hadoop Plugin which targets on Hadoop 2.2.0• Provides implementation of the collective communication abstractions

and MapCollective programming model

• Project Link– http://salsaproj.indiana.edu/harp/index.html

• Source Code Link– https://github.com/jessezbj/harp-project

Page 13: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

YARN

MapReduce V2

Harp

MapReduce Applications MapCollective Applications

Component Layers

MapReduce

Collective Communication Abstractions

MapCollective Programming Model

Applications: K-Means, WDA-SMACOF, Graph-Drawing…

Collective Communication Operators

Hierarchical Data Types (Tables & Partitions)

Memory Resource Pool

Collective Communication APIs

Array, Key-Value, Graph Data Abstraction

MapCollective Interface

Task Management

Page 14: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

A MapCollective Job

YARN Resource Manager

Client

MapCollective Runner

1. Record Map task locations from original MapReduce AppMaster

MapCollectiveAppMaster

MapCollectiveContainerLauncher

MapCollectiveContainerAllocator

I. Launch AppMaster

II. Launch Tasks

CollectiveMapper

setup

mapCollective

cleanup

3. Invoke collective communication APIs

4. Write output to HDFS

2. Read key-value pairs

Page 15: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Experiments• Applications

– K-means Clustering– Force-directed Graph Drawing Algorithm– WDA-SMACOF

• Test Environment– Big Red II

• http://kb.iu.edu/data/bcqt.html

Page 16: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

K-means Clustering

M M M M

allreduce centroids

0 20 40 60 80 100 120 1400

1000

2000

3000

4000

5000

6000

0

20

40

60

80

100

120

140

500M points 10K centroids Execution Time5M points 1M centroids Execution Time500M points 10K centroids Speedup5M points 1M centroids Speedup

Number of Nodes

Exec

ution

Tim

e (S

econ

ds)

Speedup

Page 17: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Force-directed Graph Drawing Algorithm

T. Fruchterman, M. Reingold. “Graph Drawing by Force-Directed Placement”, Software Practice & Experience 21 (11), 1991.

M M M M

allgather positions of vertices

0 20 40 60 80 100 120 1400

1000

2000

3000

4000

5000

6000

7000

8000

0

10

20

30

40

50

60

70

80

90

Execution Time Speedup

Number of Nodes

Exec

ution

Tim

e (S

econ

ds)

Speedup

Page 18: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

WDA-SMACOF

Y. Ruan et al. “A Robust and Scalable Solution for Interpolative Multidimensional Scaling With Weighting”. E-Science, 2013.

M M M M

allreduce the stress value

allgather and allreduce results in the conjugate gradient process

0 20 40 60 80 100 120 1400

5001000150020002500300035004000

100K points 200K points 300K points400K points

Number of Nodes

Exec

ution

Tim

e (s

econ

ds)

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

100K points 200K points 300K pointsNumber of Nodes

Spee

dup

Page 19: Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Conclusions• Harp is an implementation designed in a pluggable way to bring high

performance to the Apache Big Data Stack and bridge the differences between Hadoop ecosystem and HPC system through a clear communication abstraction, which did not exist before in the Hadoop ecosystem.

• The experiments show that with Harp we can scale three applications to 128 nodes with 4096 CPUs on the Big Red II supercomputer, where the speedup in most tests is close to linear.