Top Banner
Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1 , Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong Wang Microsoft Corp. Sina Corp. 1, Currently in Microsoft Research Asia
42

Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Dec 22, 2015

Download

Documents

Makaila Wilbon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars: A MapReduce Framework on Graphics Processors

Bingsheng He1, Wenbin Fang, Qiong LuoHong Kong Univ. of Sci. and Tech.

Naga K. Govindaraju Tuyong Wang Microsoft Corp. Sina Corp.

1, Currently in Microsoft Research Asia

Page 2: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Overview

• Motivation

• Design

• Implementation

• Evaluation

• Conclusion

2/41

Page 3: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Overview

• Motivation

• Design

• Implementation

• Evaluation

• Conclusion

3/41

Page 4: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Graphics Processing Units

• Massively multi-threaded co-processors– 240 streaming processors on NV GTX 280– ~1 TFLOPS of peak performance

• High bandwidth memory– 10+x more than peak bandwidth of the main

memory– 142 GB/s, 1 GB GDDR3 memory on GTX280

4/41

Page 5: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Graphics Processing Units (Cont.)

• High latency GDDR memory– 200 clock cycles of latency– Latency hiding using large number of concurrent

threads (>8K)– Low context-switch overhead

• Better architectural support for memory – Inter-processor communication using a local

memory– Coalesced access

• High speed bus with the main memory– Current: PCI-E express (4GB/sec)

5/41

Page 6: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

GPGPU

• Linear algebra [Larsen 01, Fatahalian 04, Galoppo 05]

• FFT [Moreland 03, Horn 06]• Matrix operations [Jiang 05]• Folding@home, Seti@home• Database applications

– Basic Operators [Naga 04]– Sorting [Govindaraju 06]– Join [He 08]

6/41

Page 7: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

GPGPU Programming

• “Assembly languages”– DirectX, OpenGL

• Graphics rendering pipelines

• “C/C++”– NVIDIA CUDA, ATI CAL or Brook+

• Different programming models• Low portability among different hardware vendors

– NV GPU code cannot run on AMD GPU

• “Functional language”?7/41

Page 8: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

MapReduce

8/41

Without worrying about hardware details—• Make GPGPU programming much easier.• Well harness high parallelism and high computational capability of GPUs.

Page 9: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

MapReduce Functions

• Process lots of data to produce other data

• Input & Output: a set of records in the form of key/value pair

• Programmer specifies two functions – map (in_key, in_value) -> emit

list(intermediate_key, intermediate_value) – reduce (out_key, list(intermediate_value)) ->

emit list(out_key, out_value)

9/41

Page 10: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

MapReduce Workflow

10/41

From http://labs.google.com/papers/mapreduce.html

Page 11: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

MapReduce outside google

• Hadoop [Apache project]

• MapReduce on multicore CPUs -- Phoenix [HPCA'07, Ranger et al.]

• MapReduce on Cell [07, Kruijf et al.]

• Merge [ASPLOS '08, Linderman et al.]

• MapReduce on GPUs [stmcs'08, Catanzaro et al.)]

11/41

Page 12: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Overview

• Motivation

• Design

• Implementation

• Evaluation

• Conclusion

12/41

Page 13: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

MapReduce on GPU

13/41

Web Analysis

MapReduce (Mars)

Data Mining

GPGPU languages(CUDA, Brook+)

Rendering APIs(DirectX)

GPU Drivers

Page 14: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

MapReduce on Multi-core CPU (Phoenix [HPCA'07])

14/41

Split

Map

Partition

Reduce

Merge

Input

Output

Page 15: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Limitations on GPUs

• Rely on the CPU to allocate memory– How to support variant length data?– How to allocate output buffer on GPUs?

• Lack of lock support– How to synchronize to avoid write conflict?

15/41

Page 16: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Data Structure for Mars

A Record = <Key, Value, Index entry>

Key1 Key2 Key3 …

Value1 Value2 Value3 …

16/41

Index entry1 Index entry2 Index entry3 …

An index entry = <key size, key offset, val size, val offset>

Support variant length record!

Page 17: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Lock-free scheme for result output

17/41

Basic idea:

Calculate the offset for each thread on the output buffer.

Page 18: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Lock-free scheme example

18/41

Pick up odd numbers from thearray [1, 3, 2, 3, 4, 6, 9, 8].

map function as a filter –filter all odd numbers

Page 19: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Lock-free scheme example

T1 T2 T3 T4 T1 T2 T3 T4

1 2 1 1

0 1 3 419/41

14

37

29

3

T1 T2 T3 T4

[ 1, 3, 2, 3, 4, 7, 9, 8 ]

Step1: Histogram

Step2: Prefix sum (5)

8

Page 20: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Lock-free scheme example

T1 T2 T3 T4 T1 T2 T3 T4

T1 T2 T2 T3 T4

1 2 1 1

20/41

T1 T2 T3 T4

[ 1, 3, 2, 3, 4, 7, 9 ]

Histogram (5)

Step3: Allocate

Page 21: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Lock-free scheme example

T1 T2 T3 T4 T1 T2 T3 T4

0 1 3 4

21/41

T1 T2 T3 T4

[ 1, 3, 2, 3, 4, 7, 9, 8 ]

Step4: Computation 1 3 37 9

Prefix sum

Page 22: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Lock-free scheme

22/41

1.Histogram on key size, value size, and record count.2.Prefix sum on key size, value size, and record count.3.Allocate output buffer on GPU memory.4.Perform computing.

Avoid write conflict.Allocate output buffer exactly once.

Page 23: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars Workflow

23/41

MapCount

Map

ReduceCount

Reduce

Input

Output

Sort and Group

Allocate intermediate buffer on GPU

Prefixsum

PrefixsumAllocate output buffer on GPU

Page 24: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars Workflow– Map Only

24/41

MapCount

Map

Input

Output

Allocate intermediate buffer on GPU

Prefixsum

Map only, without grouping and reduce

Page 25: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars Workflow – Without Reduce

25/41

MapCount

Map

Input

Output

Sort and Group

Allocate intermediate buffer on GPU

Prefix Sum

Map and grouping, without reduce

Page 26: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

APIs of Mars

26/41

User-defined:

MapCountMapCompare (optional)ReduceCount (optional)Reduce (optional)

Runtime Provided:

•AddMapInput•MapReduce•EmitInterCount•EmitIntermediate•EmitCount (optional)•Emit (optional)

Page 27: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Overview

• Motivation

• Design

• Implementation

• Evaluation

• Conclusion

27/41

Page 28: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars-GPU

• Operating system’s thread APIs

• Each map instance or reduce instance is a CPU thread.

28/41

Mars-CPU

• NVIDIA CUDA

• Each map instance or reduce instance is a GPU thread.

Page 29: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Optimization According to CUDA features

• Coalesced Access– Multiple accesses to consecutive memory

addresses are combined into one transfer.

• Build-in vector type (int4, char4 etc)– Multiple small data items are fetched in one

memory request.

29/41

Page 30: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Overview

• Motivation

• Design

• Implementation

• Evaluation

• Conclusion

30/41

Page 31: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Experimental Setup

• Comparison– CPU: Phoenix, Mars-CPU– GPU: Mars-GPU

CPU (P4 Quad) GPU (NV GTX8800)

Processors (HZ) 2.66G*4 1.35G*128

Cache size 8MB 256KB

Bandwidth

(GB/sec)10.4 86.4

OS Fedora Core 7.0

31/41

Page 32: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Applications

• String Match (SM): Find the position of a string in a file.[S: 32MB, M: 64MB, L: 128MB]

• Inverted Index (II): Build inverted index for links in HTML files.[S: 16MB, M: 32MB, L: 64MB]

• Similarity Score (SS): Compute the pair-wise similarity score for a set of documents.[S: 512x128, M: 1024x128, L: 2048x128]

32/41

Page 33: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Applications (Cont.)

• Matrix Multiplication (MM): Multiply two matrices. [S: 512x512, M: 1024x10242, L: 2048x2048]

• Page View Rank (PVR): Count the number of distinct page views from web logs. [S: 32MB, M: 64MB, L: 96MB]

• Page View Count (PVC): Find the top-10 hot pages in the web log.[S: 32MB, M: 64MB, L: 96MB]

33/41

Page 34: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Effect of Coalessed Access

34/41

Coalessed access achieves a speedup of 1.2-2X

Page 35: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Effect of Built-In Data Types

35/41Built-in data types achieve a speedup up to 2 times

Page 36: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Time Breakdown of Mars-GPU

36/41GPU accelerates computation in MapReduce

Page 37: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars-GPU vs. Phoenix on Quadcore CPU

37/41

The speedup is 1.5-16 times with various data sizes

Page 38: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars-GPU vs. Mars-CPU

38/41

The GPU accelerates MapReduce up to 7 times

Page 39: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Mars-CPU vs. Phoenix

39/41

Mars-CPU is 1-5 times as fast as Phoenix

Page 40: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Overview

• Motivation

• Design

• Implementation

• Evaluation

• Conclusion

40/41

Page 41: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Conclusion

• MapReduce framework on GPUs–Ease of GPU application

development

–Performance acceleration

• Want a Copy of Mars? http://www.cse.ust.hk/gpuqp/Mars.html

41/41

Page 42: Mars: A MapReduce Framework on Graphics Processors Bingsheng He 1, Wenbin Fang, Qiong Luo Hong Kong Univ. of Sci. and Tech. Naga K. Govindaraju Tuyong.

Discussion

• A uniform co-processing framework between the CPU and the GPU

• High performance computation routines– Index serving– Data mining (on-going)

• Power consumption benchmarking of the GPU– The GPU is a test bed for the future CPU.

• …42/41