Top Banner
Master's Thesis 석사 학위논문 MapReduce Architecture for a Single Computing Node of Multiprocessors Hyochan Song(송 효 찬 宋 燦) Department of Information and Communication Engineering 정보통신융합전공 DGIST 2013
49

MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

Oct 02, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

Master's Thesis

석사 학위논문

MapReduce Architecture for a Single Computing

Node of Multiprocessors

Hyochan Song(송 효 찬 宋 効 燦)

Department of Information and Communication Engineering

정보통신융합전공

DGIST

2013

Page 2: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

Master's Thesis

석사 학위논문

MapReduce Architecture for a Single Computing

Node of Multiprocessors

Hyochan Song(송 효 찬 宋 効 燦)

Department of Information and Communication Engineering

정보통신융합전공

DGIST

2013

Page 3: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

MapReduce Architecture for a Single Computing Node of

Multiprocessors

Advisor : Professor Min-Soo Kim

Co-advisor : Professor Byungchan Han

By

Hyochan Song

Department of

Information and Communication Engineering

DGIST

A thesis submitted to the faculty of DGIST in partial ful-

fillment of the requirements for the degree of Master of Sci-

ence, in the Department of Information and Communication Engi-

neering. The study was conducted in accordance with Code of Re-

search Ethics1

11. 15. 2012

Approved by

Professor Min-Soo Kim ( Signature )

(Advisor)

Professor Byungchan Han( Signature )

(Co-Advisor)

1 Declaration of Ethical Conduct in Research: I, as a graduate student of DGIST, hereby declare that

I have not committed any acts that may damage the credibility of my research. These include, but are

not limited to: falsification, thesis written by someone else, distortion of research findings or

plagiarism. I affirm that my thesis contains honest conclusions based on my own careful research

under the guidance of my thesis advisor.

Hyochan
스티커 노트
Hyochan에 의해 설정된 Marked
Hyochan
스탬프
Hyochan
스탬프
Page 4: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

MapReduce Architecture for a Single Computing Node of

Multiprocessors

Hyochan Song

Accepted in partial fulfillment of the requirements for

the degree of Master of Science.

12. 05. 2012

Head of Committee 김 민 수 (인)

Prof. Min-Soo Kim

Committee Member 한 병 찬 (인)

Prof. Byungchan Han

Committee Member 장 재 은 (인)

Prof. Jae Eun Jang

Hyochan
스탬프
Page 5: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

i

MS 201142004

송 효 찬. Hyochan. MapReduce Architecture for a Single Computing Node of Mul-

tiprocessors. Department of Information and Communication Engineering. 2012.

41p. Advisors Prof. Kim, Min-Soo, Prof. Co-Advisors Han, Byungchan.

ABSTRACT

Recently, the paradigm of micro-architecture design of CPUs is shifting to on-chip

multi-core processors, and moreover, to many-core coprocessors for general computing

such as NVIDIA’s Tesla and Intel’s Xeon Phi. Meanwhile, the MapReduce framework has

been extensively used and studied for big data analysis, which runs typically on a large

cluster of cheap commodity nodes. We propose a new MapReduce framework called Hy-

brid-core based big Data (Real-time) Analysis (HYDRA) that regards a single node

equipped with both multi-core CPUs and many-core GPUs as a cluster of nodes, where a

single processor plays a role of a single node. By fully exploiting the computing power of

the modern heterogeneous-core systems, HYDRA could achieve a comparable perfor-

mance with a small-scale cluster of nodes. Especially, HYDRA is based on the shared-

memory architecture, and so, has no cost of transferring data via network in a shuffle step

of MapReduce, whereas the conventional MapReduce could have a large cost in that step

depending on a kind of task. Under the proposed framework, we propose two strategies,

“Processor As A Node” (PAAN) and “GPU Mapper CPU Reducer” (GMCR). PAAN con-

siders a multiprocessor of either CPU or GPU as a node in the same way. On the other

hand, GMCR considers GPUs as only mapper nodes and CPUs as only reducer nodes dis-

similarly. The proposed strategies tackle the challenging issues such as how to cooperate

two types of processors (i.e., CPUs and GPUs), how to manage different memory hierar-

chies in those types, and how to minimize data communication overhead between CPUs

and GPUs. Extensive experimental results show that HYDRA outperforms the conven-

tional MapReduce on a cluster of eight commodity nodes by up to more than 14 times.

Keywords: MapReduce, Heterogeneous computing, GPGPU, multicore, manycore

Page 6: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

ii

Contents

Abstract ······················································································· i List of contents ············································································· ii

List of tables ················································································· iii List of figures ··············································································· iv

Ⅰ. INTRODUCTION ········································································ 1

Ⅱ. BACKGROUND

2.1 Notations ···················································································· 4

2.2 MapReduce ················································································· 4

2.3 General Purpose computing on Graphics Processing Unit (GPGPU) ·············· 7

Ⅲ. THE HYDRA SYSTEM

3.1 System Overview ·········································································· 9

3.2 Processor As A Node (PAAN)

3.2.1 Strategy Architecture ·························································· 10

3.2.2 CPU Node Workflow ························································· 13

3.2.3 GPU Node Workflow ························································· 13

3.3 GPU Mapper CPU Reducer (GMCR)

3.3.1 Strategy Architecture ·························································· 18

3.3.2 GPU Mapper Workflow ······················································ 19

3.3.3 CPU Reducer Workflow ······················································ 22

Ⅳ. EVALUATI0ON

4.1 Experimental Setup ······································································· 25

4.2 Application – Word Count ······························································· 25

4.3 Performance Evaluation ·································································· 26

Ⅴ. REALTED WORK

5.1 MapReduce Framework with the CPU ················································· 30

5.2 MapReduce Framework with the Accelerators ······································· 30

5.3 Programming Tools for the GPGPU ··················································· 32

Ⅵ. CONCLUSIONS ······································································ 33

Page 7: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

iii

List of tables

Table 1. Notations ··········································································· 4

Page 8: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

iv

List of figures

Figure 1. The manycore architecture of GPU ············································ 6

Figure 2. Heterogeneous-core multiprocessor system ·································· 8

Figure 3. Shared-nothing model ························································· 11

Figure 4. PAAN strategy architecture ··················································· 11

Figure 5. The map stage on the GPU ··················································· 14

Figure 6. The data flow of PAAN ······················································· 15

Figure 7. The simple example of PAAN ··············································· 16

Figure 8. GMCR strategy architecture ·················································· 18

Figure 9. The pipelining methods of GMCR ··········································· 20

Figure 10. The data flow of GMCR ····················································· 22

Figure 11. The simple example GMCR················································· 23

Figure 12. The execution times of word count ········································· 26

Figure 13. The speedup of word count ·················································· 27

Figure 14. The execution times of page view count ·································· 28

Figure 15. The speedup of page view count ············································ 29

Page 9: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 1 -

Ⅰ. INTRODUCTION

Recently, highly parallel architectures have enabled for general purpose com-

putation in exa-scale database. Especially, highly parallel Graphics Processing Units

(GPUs) have enabled high performance in general purpose computation. Meanwhile, the

general computer architecture is becoming the heterogeneous-core multiprocessor system

which has both of multicore CPUs and manycore GPUs. Additionally, the memory struc-

ture is hierarchical shared-memory model including the main memory and the GPU device

memory.

Traditionally, the CPU technology has developed to increase the clock fre-

quency (speed). For example, IBM Power6 CPU released in 2007 set the world’s best as

5GHz. However, increasing the clock frequency came up with serious complications such

as high power consumption and much heat radiation. It called power wall. Even though the

density of transistor might be increasing, due to power wall the CPU clock frequency could

not improve. Thus, the vendors have evolved a new CPU technology that more two cores

on a single die. The clock frequency is staying from 2000s, but as increasing the number of

cores (transistors) the theoretical maximum performance of the CPU (Peak FLOPS) have

increased steadily.

There is a big change in the architecture of the GPU during the CPU architec-

ture developed single core into multi core. Traditionally, the GPU was for graphic pro-

cessing, however, the GPU have enabled for general computing since it has manycore. It

called General-Purpose computing on Graphics Processing Units (GPGPU). Manycore

usually consist of a set core about 500 to 2000. General computing languages is available to

Page 10: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 2 -

develop the GPGPU applications. The GPGPU could be applied in various domains such as

oil exploration, linear analysis, stock analysis, and fingerprint analysis. They are known as

tens to hundreds of times faster than the CPU executions

As the number of cores in a processor is increased, the parallel programming

model is definitely required to use all cores and communicate among them. Traditional inter

process communication (IPC) methods, such as message passing, synchronization, shared

memory, and remote procedure calls (RPC), are still utilized. However, these methods re-

quire the expensive initial developing costs, such as managing manually computational re-

sources by the developer. They also need to tune the source code when the application is

ported to a different scale system.

The MapReduce framework is a prominent framework to support such data

processing applications [1]. It was originally proposed by Google at 2004 for the ease de-

veloping on large scale data sets on clusters of computers [2]. Therefore, we propose a

MapReduce framework called Hybrid-core based big Data (Real-time) Analysis (HYDRA)

on the heterogeneous-core multiprocessor system so that programmers can easily enable

high performance for data processing.

The HYDRA system is designed to fully utilize the heterogeneous-core multi-

processor and hierarchical shared-memory system in a single computing node that includes

both of CPUs and GPUs. It has two big strategies, Processor As A Node (PAAN) and GPU

Mapper CPU Reducer (GMCR). PAAN virtualizes each multiprocessor as a single node

and GMCR is that GPU is mapper to produce intermediate data and CPU is then reducer to

consume intermediate data into output.

The organization of the paper is as follows. Section II presents notation which

is used in the paper. The overview of MapReduce and GPGPU also are included in detail.

Page 11: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 3 -

Section III explains our implementation of the HYDRA system which has two strategies,

PAAN and GMCR. Section IV evaluates performance in which we implement the applica-

tion. Finally, we show related works in Section V and conclude in Section VI.

Page 12: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 4 -

II. BACKGROUND

2.1 Notations

In this section, we would like to describe the notations used in this paper.

First of all, main memory is equal to host memory and evice memory means local memory

in the GPU. All solid lines express data or task on the CPU and dotted lines indicate data or

task on the GPU.

Table 1. Notations

Symbol Description

M map

C&P combine & partition

IM intermediate data

S sort

R reduce

Op partial output data

HtoD data communication host to device

DtoH data communication device to host

2.2 MapReduce

The MapReduce is a software framework introduced by Google in 2004 to

support distributed computing on large scale data sets on clusters of computers. This

framework provides two primitive operations (a) a map function to process input key/value

pairs and to generate intermediate key/value pairs, and (b) a reduce function to merge all

intermediate pairs associated with the same key. Programmers can implement their applica-

Page 13: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 5 -

tion logic using the two primitive functions, map and reduce. The MapReduce runtime then

automatically executes the task on any machine. Thus, this framework reduces the com-

plexity of parallel programming, so that the developer can easily exploit the parallelism for

complex tasks.

The following pseudo code illustrates a program written using MapReduce.

The problem of counting the number of occurrences of each word in a large collection of

documents might be written as the following pseudo code [2].

Algorithm 1 : WordCount - Map

Input: key: document name; value: document contents Output: list(w, list(count)); Method: 1: for each word w in value: 2: EmitItermidate(w, “1”);

Algorithm 2 : WordCount - Reduce

Input: key: a word; values: a list of counts

Output: list(key, result); Method: 1: result 0; 2: for each v in values: 3: result result + ParseInt(v); 4: Emit(AsString(result));

The map function emits each word plus a count of occurrences. The reduce

function sums together all counts emitted for a specified word. More than ten thousand dis-

tinct programs have been implemented using MapReduce at Google, including algorithms

for large-scale graph processing, text processing, data mining, machine learning, statistical

machine translation, and many other areas. More discussion of specific applications of

MapReduce can be found elsewhere [1, 3-9].

Page 14: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 6 -

Especially, Apache Hadoop is an open-source software framework that sup-

ports data-intensive distributed applications. It also supports the running of applications on

large clusters of commodity hardware. The Hadoop framework transparently provides ap-

plications both reliability and data motion. It enables applications to work with thousands of

computation-independent computers and petabytes of data. Hadoop was derived from

Google's MapReduce and Google File System (GFS) papers [10].

2.3 General Purpose computing on Graphics Processing Unit (GPGPU)

The GPUs have deployed from simple graphical devices into powerful copro-

cessors with the CPU. The GPU contains a large number of programmable cores, which are

specialized for heavy work load in a parallelism. Most computer systems also have at least

more than one GPU. Figure 1 simplifies the architecture of GPU with the host CPU.

Figure 1. The manycore architecture of GPU

GPGPU has recently emerged in various applications, such as matrix opera-

tions [11-13], machine learning [14, 15], bioinformatics [16-18], databases [19, 20], and

CPUgeneral-purpose

cores

L1/L2/L3 cache

Main Memory

GPU

Device Memory

Streaming Multiprocessor i

acceleratorcores

cache

… …

Page 15: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 7 -

data mining [21, 22]. Such applications of the GPU have outperformed the state-of-art mul-

ti-core CPUs. Hence, it could be a new possibility in many research areas to require the

high computational capabilities. Especially, data processing is one of research which its

performance could be significantly increased. Recently, several GPGPU languages includ-

ing AMD CTM [23] and NVIDIA CUDA [24] have been proposed by GPU vendors. They

usually expose a general-purpose, massively multithreaded parallel computing architecture

and provide a programming environment similar to multi-threaded C/C++. NVIDIA CUDA

has been used for implement our system.

Page 16: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 8 -

III. THE HYDRA SYSTEM

3.1 System Overview

The modern computer system architecture is the heterogeneous-core multipro-

cessor system on a single computing node including both of multicore CPU and manycore

GPU. Hence, the HYDRA system has implemented MapReduce on this heterogeneous-core

multiprocessor system as figure 2.

Figure 2. Heterogeneous-core multiprocessor system

HYDRA is aiming at fully utilize the heterogeneous-core multiprocessor sys-

tem. The hardware designs of CPU and GPU are extremely different. HYDRA has handled

to cooperate among different types of processors.

Since GPU usually has its own memory, direct data communication between

CPU and GPU is not possible, namely, host and device memory are constructing the hierar-

chical shared-memory structure. It requires an additional task to transfer data through PCI

CPU 1general-purpose

cores

L1/L2/L3 cache

CPU n

L1/L2/L3 cache

Interconnect Bus

Main Memory

Co-Processor 1

device memory

Co-Processor m

device memory

PCI Interface

general-purposecores

acceleratorcores

acceleratorcores

Page 17: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 9 -

interface. Even device memory size is limited, so that HYDRA handles this complex

memory structure. As a result, HYDRA has several challenges as follow:

(a) How to cooperate extremely different two types of processors (CPUs and

GPUs)

(b) How to manage unique hierarchical memory structure

(c) How to minimize (hide) data communication overhead

We propose two special strategies, Processor As A Node (PAAN) and GPU

Mapper CPU Reducer (GMCR), to solve the challenges and to guarantee high performance

in HYDRA runtime. PAAN virtualizes each multiprocessor as a single node (figure 4). An-

other strategy is GMCR, GPU is mapper to produce intermediate data and CPU is then re-

ducer to consume intermediate data to output (figure 8).

The detail architectures and implementations of PAAN and GMCR are de-

scribed at the Section 3.2 and 3.3, respectively.

3.2 Processor As A Node (PAAN)

3.2.1 Strategy Architecture

While MapReduce programming model is originally shared-nothing model

(figure 3) to do distributed computing on clusters of computers, Processor As A Node

(PAAN) strategy is shared-memory model that has hierarchy of memory structure. PAAN

virtualizes each multiprocessor as a single node of original MapReduce (figure 4).

The HYDRA PANN has two kinds of nodes, CPU node and GPU node as well

as one data queue. Specifically, the CPU node consists of one multicore CPU, its cache, and

host memory, the GPU node is made up of one manycore GPU, its cache, device memory,

and host memory. Data queue should be a mutual exclusion object to prevent duplicated

data executions. We protect the data queue by using mutex and condition variable against

Page 18: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 10 -

race condition among multi-threads.

The CPU node executes repeated map task with a chunk that is small data size.

Reduce task is start after all map tasks are completed. It is not repeated, so that it has a

changeable data size. Since we cannot manage directly GPU, GPU node needs at least one

CPU thread that operates GPU device(s) such as memory transferal, launching kernel. GPU

has memory limitation in term of size and dynamic allocation, so that we need to consider

the further work to handle those problems.

Figure 6 and figure 7 show PAAN data flow and PAAN simple example, re-

spectively. PAAN CPU node executes map function until data queue is empty. GPU node

mechanism is similar to CPU node, but it requires addition steps for data communication.

Figure 7 shows a simple example applied data flow. The application is the count different

shapes and colors. We discuss more detail implementation of HYDRA PAAN in Section

3.2.2 and 3.2.3.

Page 19: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 11 -

Figure 3. Shared-nothing model

Figure 4. PAAN strategy architecture

Interconnect Network

CPU

Disk

Cache

Main memory

Node1

CPU

Disk

Cache

Main memory

Nodei

CPU

Disk

Cache

Main memory

Node2

Interconnect Bus

Main Memory

Disk

PCI Interface

NodeC1 NodeCn

CPU 1

Cache

CPU n

Cache

GPU 1

Device memory

SM k

Cache

NodeG1k

GPU m

Device memory

SM k

Cache

NodeGmk

Page 20: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 12 -

3.2.1 CPU Node Workflow

The work flow of PAAN MapReduce in CPU node is straightforward. Each

core has one thread that could be assigned to map or reduce task. Each map task (thread)

launches map function, store intermediate then data into hash container. Although duplicat-

ed key, original MapReduce stores key-value data as a format of list separately. Since HY-

DRA is shared-memory model, hash container can be used for storing intermediate data. It

brings the effect of reducing the size of intermediate data.

Each map task has hash containers as the count of reducers to avoid additional

shuffle cost. The reducer ID of key to be extracted is computed by simple function. The key

and value are stored in the appropriate hash container by the reduce ID. The map task re-

peats until data queue is empty. The data size at a time that be processed depends on L3

cache size and the number of cores in a multiprocessor.

The shuffle stage could be negligible. Passing the hash containers to each re-

ducer is only necessary. Each reducer merges all of passed hash containers. Finally, it needs

integration to the output from GPU nodes.

3.2.2 GPU Node Workflow

The work flow of PAAN MapReduce on GPU node is more complicated than

its CPU node work flow. We cannot access directly the GPU device, and GPU does not

support dynamic memory allocation on the device memory, so that at least one CPU thread

(called GPU master thread) have to manage GPU tasks such as transferring data between

host and device memory, memory allocation/deallocation, and launching kernel. The GPU

master thread also is responsible for input/output data.

The input data transfers into the device memory before starting map stage. The

output data requires transferring to the host memory after completing the reduce stage.

Page 21: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 13 -

Since dynamic memory allocation is not available on the device memory, the

map stage of the GPU node is divided to counting keys, prefix sum, and mapper. The step

of counting keys produces an array as the number of GPU threads. Each array element has

the count of keys for given data to each thread. The step of prefix sum is to find the index

of intermediate data on the device memory. Finally, mapper step stores keys to the array of

intermediate data by the value of counting keys array. Figure 5 shows the mechanism on the

map stage.

Page 22: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 14 -

Figure 5. The map stage on the GPU

While the CPU node uses the hash container, the GPU node utilizes sort and

reduction to reduce operation. This is because there is no suitable hash container for GPU.

The partial output is generated after sort and reduction. Note that since the GPU node exe-

cutes locally map and reduce stages, we do not need shuffle stage. Therefore, the output of

the GPU node could not be the final output. It is a partial output to be integrated with other

outputs of the GPU nodes and the CPU nodes.

TID KVcnt

0 3

1 2

2 3

3 1

4 3

5 2

6 3

7 1

8 3

9 3

KVcnt

012

34

567

8

91011

1213

141516

17

181920

212223

pair<K,V> Addr

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

prefixSum()

// store a key value pairKV[ KVcnt[TID] ] = pair<k,v>;KVcnt[TID]++;

Page 23: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 15 -

Figure 6. The data flow of PAAN

n : # of CPUsm : # of GPUs

Data in MM

Data in DM

Operation on Host

Operation on Device

Input Data (MM)

Intermediate Data (MM)

NodeGPU1

NodeCPUn

NodeCPU1

NodeGPUm

Output Data (MM)

Merge

RR

OpOp

SS

RR

OpOp

SS

MM

C&P

MM

IM

C&P

IM

MM

C&P

MM

IM

C&P

IM

MMMM

IM

HtoD

DtoH

RR RR

SS SS

Op

MMMM

IM

HtoD

DtoH

RR RR

SS SS

Op

MMMM

IM

HtoD

DtoH

RR RR

SS SS

Op

MMMM

IM

HtoD

DtoH

RR RR

SS SS

Op

Shuffle

Page 24: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 16 -

Figure 7. The simple example of PAAN

Input

NodeCPU1

NodeCPUn

C & P C & P

M M M M

C & P C & P

M M M M

C & P C & P

M M M M

NodeGPU1

NodeGPUm

Shuffle

21

21

22

11

33

33

33

42

45

54

64

53

111

111

111

111

111

111

111

111

46

55

54

43

R

S

R

S

Input

Input

111

111

111

111

Input

H to D

M M M M

H to D

M M M M

D to H

2 4 3 3

111

111

111

111

111

111

111

111

S

R

2 4 3 3

D to H

2 4 3 3

S

R

2 4 3 3

10 10 9 7

H to D

M M M M

H to D

M M M M

D to H

4 3 3 2

111

111

111

111

111

111

111

111

S

R

4 3 3 2

D to H

3 4 3 2

S

R

3 4 3 2

H to D

M M M M

H to D

M M M M

D to H

4 3 2 3

111

111

111

111

111

111

111

111

S

R

4 3 2 3

D to H

4 3 2 3

S

R

4 3 2 3

Input

Input

Merge

Partial OutputPartial Output

Partial Output

Partial Output

29 31 25 23Final Output

Page 25: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 17 -

3.3 GPU Mapper CPU Reducer (GMCR)

3.3.1 Strategy Architecture

While HYDRA PANN is symmetric task strategy, HYDRA GPU Mapper CPU

Reducer (GMCR) borrows the concept of producer and consumer problem. All GPUs be-

come mappers that produce the intermediate data otherwise all CPUs are reducers that re-

duce intermediate to output. In other words, GPU mapper is the producer and CPU reducer

is the consumer. We utilize a synchronize mechanism to guarantee integrity of the result.

Figure 10 and figure11 show GMCR data flow and a simple example, respec-

tively. GMCR GPU mapper executes map function until data queue is empty. CPU reducer

launches reduce function whenever an iteration of map is over. Figure 11 shows a simple

example applied data flow. The application is also the count different shapes and colors.

The limitation of the device memory capacity causes repetition of map stage.

While the original MapReduce waits until finishing all of map stage, HYDRA GMCR can

start the reduce stage on the CPU whenever the end of a map iteration. It is represent the

stage (task) pipelining that leads to performance improvement. In addition, the GMCR uses

the stream that supported by the NVIDA CUDA. It also allows sophisticated pipelining

methods. More detailed implementation is mentioned in Section 3.3.2.

The CPU reducer is similar with reducer on the PAAN CPU node. We discuss

more detail implementation of the CPU reducer in Section 3.3.3.

Page 26: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 18 -

Figure 8. GMCR strategy architecture

3.3.2 GPU Mapper Workflow

We need the GPU master thread that manages the tasks of the GPU mapper due

to lack of direct access to the device. The GPU master thread is responsible for data com-

munication, memory management, streams scheduling, and so on. The GPU mapper is con-

sists of counting keys, prefix sum, and mapper since dynamic memory allocation is not pos-

sible, The work flows of each step are same as the map stage of the PAAN GPU node.

The prominent characteristic of the GPU mapper is using the stream which is a

sequence of operations that execute in issue-order on the GPU. The GPU operations in dif-

ferent streams may run concurrently. In other words, the stream allows concurrent execu-

tion among GPU kernel and data communication. Figure 9-(c), 9-(d) show that data com-

munication host to device in stream 1, GPU kernel in stream 2, and data communication

Interconnect Bus

Main Memory

Disk

PCI Interface

Reduce1 Reducen

CPU 1

Cache

CPU n

Cache

GPU 1

Device memory

SM k

Cache

Map1

GPU m

Device memory

SM k

Cache

Mapm

Page 27: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 19 -

device to host in stream 3 can execute simultaneously. Ideally, when the time of kernel and

the time of data communication are equal, performance improvement is dramatic.

Due to the use of stream, we propose stage (task) pipelining as well as other

three pipelining methods, in map pipelining (on GPU), and integrated (nested) pipelining.

Stage (task) pipelining: The map and reduce run concurrently on GPU and

CPU, respectively. The GPU mapper produces intermediate data repeatedly, and it trans-

fers then the data to host memory. When all the data transfer is complete, the GPU master

thread sends to the signal to the reducers. The CPU reducer received the signal starts re-

duce stage immediately. Figure 9-(b) illustrates the concept of the stage pipelining.

In map pipelining (on GPU): The GPU mapper stage can be divided by the

streams. It consists of data communication host to device (HD), GPU kernel (K), and data

communication device to host (DH). We split each minor task to apply pipelining like fig-

ure 9-(c). The figure 9-(c) shows an example where the number of streams is three.

Since, the reduce stage can start where all of the map stage is finished, there is

an inefficient problem that the CPU is idle during executing the map stage repetition.

Integrated (nested) pipelining: This method integrates stage and in map pipe-

lining. The GPU map stage is divided by the streams. The reducers start their task when-

ever the end of a map iteration in the stream. Figure 9-(d) shows the integrated pipelining

which can leads to improved performance. Hence, we employ this method to perform

HYDRA GMCR.

Page 28: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 20 -

Figure 9. The pipelining methods of GMCR

Map Reduce

HD DHK HD DHKHD DHK Reduce

(a) basic flow

HD DHK

HD DHK

HD DHK

Reduce

Reduce

Reduce

(b) stage (task) pipelining

Reduce

HD

DHK

HD

DH

K

HD

DH

K

HD

DHK

HD

DH

K

HD

DH

K

HD

DHK

HD

DH

K

HD

DH

K

(C) in map pipelining (on GPU)

R

HD

DH

K

HD

DHK

HD

DH

K

HD

DH

K

HD

DH

K

HD

DH

K

HD

DH

K

HD

DH

K

HD

DH

K

(d) integrated (nested) pipelining

R

R

R

R

R

R

R

R

Page 29: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 21 -

3.3.3 CPU Reducer Workflow

The CPU reducers store the intermediate data by producing the GPU mappers

into hash containers. Each CPU thread has a hash container, so that the number of hash con-

tainers is equal to the number of CPU cores.

The GPU map stage should repeat mapper due to limitation of the device

memory capacity. Hence, the CPU reducers also repeat waiting and running. The reducer

initially waits for a signal from the GPU mapper. When a stream is finished, its mapper

sends a signal which means that waiting reducers can start reduce task. The reducers caught

signal start immediately their task and they wait then again. This work flow is same regard-

less of the pipelining methods. Note that the time of the executing reducers is only differ-

ent. In addition, all of hash container should be merged since duplicated keys can be stored

in two more hash containers.

While original MapReduce has a barrier to perform shuffle stage where map

stage is completed. The HYDRA GMCR executes in a pipelined method. Hence, so that the

barrier for the shuffle stage is meaningless. Instead, we have performed that the reduce task

executes incremental.

Page 30: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 22 -

Figure 10. The data flow of GMCR

Input Data (MM)

Output Data (MM)

GPU-0Stream

1

GPU-1Stream

1

MM

IMIM

HtoD

DtoH

MM

IMIM

HtoD

DtoHMM

IMIM

HtoD

DtoH

MM

IMIM

HtoD

DtoHMM

IMIM

HtoD

DtoH

MM

IMIM

HtoD

DtoH

GPU-0Stream

2

GPU-1Stream

2

GPU-0Stream

3

GPU-1Stream

3

MM

IMIM

HtoD

DtoH

MM

IMIM

HtoD

DtoHMM

IMIM

HtoD

DtoH

MM

IMIM

HtoD

DtoHMM

IMIM

HtoD

DtoH

MM

IMIM

HtoD

DtoH

n : # of CPUsm : # of GPUs

Data in MM

Data in DM

Operation on Host

Operation on Device

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

CPU-0core1-2

CPU-0core3-4

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

CPU-0core5-6

CPU-1core1-2

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

Shuffle

RR

SS

IMIM

OpOp

PushPush

CPU-1core3-4

CPU-1core5-6

Incremental Reduce

T1

T2

T3

T4

T5

T6

Page 31: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 23 -

Figure 21. The simple example GMCR

111111111

1111111

1111111111

1111111111

Stream1 Stream2

M M

H to D

M M

Streamn

M M

H to D

M M

D to H D to H

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

H to D

M M M M

D to H

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

111111111

1111111111

1111111111

11111111

11111111

11111111111

1111111111

1111111

11111111

Shuffle Shuffle Shuffle

R

S

R

S

R

S

R

S

R

S

R

S

9 7 10 10 8 8 10 10 7 8 11 10

1 2 n

Incremental Reduce

9107

10

17201520

24312330

1

2 n

Input

Partial Output

Final Output

Page 32: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 24 -

IV. EVALUATION

4.1 Experimental Setup

Our experiment is performed on a single computing node with two Tesla

C2070 GPUs and two Intel Xeon X5675 processors running on Linux OpenSUSE 64bit.

The hard drive is a 1.8TB SATA magnetic hard disk. The GPU consists of 14 streaming

multi-processors, each of which has 32 cores running at 1.15GHz. In contrast, the CPU has

six cores running at 3.07GHz. The host memory is 100GB, and the device memory of the

GPU is 6GB. The CPU L3 cache size is 12MB. And the GPU L2 cache size is 768KB. The

GPU uses a PCI-Express 2.0 bus to transfer data between the host memory and the device

memory with a theoretical bandwidth of 16 GB/sec. The device memory of Tesla C2070

achieves a bandwidth up to 144GB/sec, whereas the X5675 CPU has 25.6 GB/sec.

Otherwise, we use 8 Hadoop nodes, one master and 7 slaves. Each node has In-

tel i7 2600 processors 3.40GHz on openSUSE 64bit. It has 500GB SATA magnetic hard

disk and 16GB host memory.

4.2 Applications

Word Count: The word count is the number of words in a document or pas-

sage of text. Word counting may be needed when a text is required to stay within certain

numbers of words. This may particularly be the case in academia, legal proceedings, jour-

nalism and advertising. Word count is commonly used by translators to define the price for

the translation job. Word counts may also be used to calculate measures of readability and

Page 33: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 25 -

to measure typing and reading speeds (usually in words per minute). The Map tasks process

different sections of the input files and return intermediate data that consist of a word (key)

and a value of 1 to indicate that the word was found. The Reduce tasks add up the values

for each word (key). Data set of word count is synthetic and its pseudo code has shown in

Section II.2.

Page View Count: The page view count is count the number of distinct page

views from web logs in Wikimedia projects [25]. A log entry is a 4-aray tuple. It counts the

number of page views for each page. The MapReduce processes the key/value pairs gener-

ated from the input file. The Map outputs the URL as the key and the count as the value.

The sort is required before starting reduce task.

4.3 Performance Evaluation

We have implemented word count on mentioned a single node. The input data

file is randomly generated. The range of the experimental data size is 64GB to 1042GB.

Figure 12 and 14 show execution times of word count and page view count on HY-

DRA_PAAN, HYDRA_GMCR, and Hadoop, respectively. Furthermore, figure 13 and 15

shows speedup of each application on HYDRA_PAAN, HYDRA_GMCR compare to Ha-

doop. The postfix number is the number of GPUs using on each strategy to show effect of

the number of GPUs.

Page 34: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 26 -

Figure 32. The execution times of word count

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

32GB 64GB 128GB 256GB 512GB 1024GB

Execution Time

Hydra_PAAN 1_GPU Hydra_PAAN 2_GPU

Hydra_GMCR 1_GPU Hydra_GMCR 2_GPU

Hadoop

Page 35: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 27 -

Figure 43. The speedup of word count

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

32GB 64GB 128GB 256GB 512GB 1024GB

Speedup

Hydra_PAAN 1_GPU Hydra_PAAN 2_GPU

Hydra_GMCR 1_GPU Hydra_GMCR 2_GPU

Page 36: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 28 -

Figure 54. The execution of page view count

0

5000

10000

15000

20000

25000

64GB 128GB 256GB 512GB 1024GB

Execution Time

Hydra_PAAN 1_GPU Hydra_PAAN 2_GPU

Hydra_GMCR 1_GPU Hydra_GMCR 2_GPU

Hadoop

Page 37: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 29 -

Figure 65. The speedup of page view count

The graph in figure 12 to 15 shows that the HYDRA has improved perfor-

mance compared to Hadoop since the HYDRA is a single node that has shared memory

system and the HYDRA utilize heterogeneous-core multiprocessors. The reason why

PAAN is faster than GMCR is that the GPU map stage has three minor step, counting keys,

prefix sum, and mapper. It leads to overhead but sophisticated data structure might solve

the problem such GPU hash. HYDRA GMCR with one GPU shows worst performance

0.00

2.00

4.00

6.00

8.00

10.00

12.00

64GB 128GB 256GB 512GB 1024GB

Speedup

Hydra_PAAN 1_GPU Hydra_PAAN 2_GPU

Hydra_GMCR 1_GPU Hydra_GMCR 2_GPU

Page 38: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 30 -

since computation power of only one GPU is not enough when data size is increasing. As

data size is increasing, word count speedup is decreasing, otherwise, page view count

speedup is increasing except HYDRA GMCR 1 GPU. The intermediate data size in hash

containers of PCV is smaller than WC since PCV has many duplicated keys

As I mentioned, the GPU has the restrictions that unavailable dynamic memory

allocation and the limitation of device memory capacity. They are still challenges to im-

prove performance. Even we need a CPU thread to handle the GPU device. Fortunately, we

believe that problem would be solved at closet future since the parallel computing is prom-

ising approach in various fields.

Page 39: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 31 -

V. REALATED WORK

Related works could be categorized into three domains. The HYDRA belongs

to the MapReduce model with the accelerator.

5.1 MapReduce Framework with the CPUs

The MapReduce model is utilized at various research areas such as data min-

ing, machine learning, and bioinformatics. Furthermore, a number of implementations and

extensions have been proposed. Hadoop [26] is one of popular open source implementation

written by Java for the multi-node system. Ostrich [27] is an extension to MapReduce that

applies tiling in order to optimize the use of memory, cache and CPU resources. [9] and [1]

have applied the merge operation to MapReduce for relational databases and MapReduce to

ten machine learning algorithms on a multi-core CPU, respectively.

Phoenix [28] is an efficient implementation of the MapReduce framework on

multi-core CPUs. Phoenix++ [29] is the latest version of Phoenix, which provides improved

scalability over Phoenix. Additionally, Phoenix is providing the API for the developer who

is unfamiliar to parallel programming. Phoenix provides an efficient MapReduce imple-

mentation on multi-core CPUs, nevertheless, its drawback is that impossible to support dif-

ferent platforms, such as the heterogeneous-core multiprocessor system.

5.2 MapReduce Framework with the Accelerators

Since the accelerator processors have emerged as a coprocessor with the CPU,

there are several attempts to implement MapReduce on the accelerators, such as GPU [30-

32], FPGA [33], and CELL [34].

Page 40: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 32 -

[30] proposes a MapReduce framework on the GPU, but it requires that the de-

veloper have to manually handle the detail of the GPU such as the thread management and

the hierarchical memory structure. It also focuses on several small size tasks. FPMR [33] is

described as a MapReduce framework on FPGA, which provides programming abstraction,

hardware architecture, and basic building blocks to developers. CellMR [34] is an efficient

and scalable implementation of the MapReduce framework for asymmetric Cell-based clus-

ters. CellMR divides the traditional MapReduce into a pipeline of three steps, map, partial

reduction, and global reduction.

To using the GPU, Mars [31] is the first systematic framework, though its

scalability is limited. Mars has several changes from the traditional MapReduce due to the

limitations of the GPU. There are two counting phases in each Map and Reduce step which

are called MapCount and ReduceCount, respectively. For each counting phase, the emitted

data is stored into the GPU memory. It might inefficient due to that the GPU memory is un-

able to be allocated without the handled by the CPU. Another problem is that they use sort-

ing to group the intermediate key-value pairs instead of the specific partitioning methods

such as hashing. This might be because of hard to implement hash table on GPU. Conse-

quently, Mars uses bitonic sort partition the intermediate key-value pairs though this is less

efficient than hashing.

GPMR [32] is also the GPU MapReduce framework that enhances the power

of the GPU clusters for large-scale computing. To better utilize the GPU, they modify

MapReduce by combining large amounts of map and reduce items into chunks and using

partial reductions and accumulation.

There are also attempts to implement MapReduce on the heterogeneous pro-

cessors machine. MapCG [35] is another GPU-based MapReduce framework. The main

Page 41: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 33 -

goal is to allow the portability of MapReduce code between the multicore CPUs and the

GPU. However, MapCG has a limitation that only offers scalability by one GPU. [12] pro-

poses the Merge framework for dynamically scheduling MapReduce tasks among heteroge-

neous processors. Their scheduling method is used for co-processing between the CPU- and

the GPU-based MapReduce frameworks.

5.3 Programming Tools for the GPGPU

The GPU has the differences in thread and memory model and instruction set

architecture (ISA) against the CPU. Therefore, current multi-thread CPU code might not

execute on the GPU directly. As a result, various programming models have been proposed

to develop within the GPU, such as Brook [36], and CUDA [24]. These models are special-

ized to the GPU. In other words, applications written in these models cannot be executed on

the CPU as well as other platforms except the GPU.

To minimize the programming effort, another programming model is proposed

to bridge the gap between the CPU and the GPU. OpenCL [37] tries to provide unified view

of ISA between the CPU, the GPU and other accelerators. It enables the developer to gen-

erate the same kernel code to execute on different processors including the CPU and the

GPU. However, programmers are still responsible for handling the communication between

the processors. For efficiency and generality, we have employed CUDA to implement our

framework, HYDRA.

Page 42: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 34 -

VI. CONCLUSIONS

Multicore and manycore processors have emerged as a commodity platform for

parallel computing. However, the developer requires the knowledge of these architectures

and much effort in developing applications. Since MapReduce has been successful in eas-

ing the development, this paper has proposed the attractive MapReduce system on a single

computing node that has heterogeneous-core multiprocessors. We design the Hybrid-core

based big Data Real-time Analysis (HYDRA), an implementation of MapReduce that fully

utilizes multicore CPU and manycore GPU. Additionally, in the hierarchical memory struc-

ture, it minimizes the overheads of data communication between host memory and device

memory, or vice versa. The runtime on the heterogeneous-core multiprocessor system is

completely hidden from the developer by our framework.

In particulars, we have suggested two big strategies, Processor As A Node

(PAAN) and GPU Mapper CPU Reducer (GMCR). PAAN virtualizes each multiprocessor

as a single node and GMCR is that GPU is mapper to produce intermediate data and CPU is

then reducer to consume intermediate data into output. We have written the detail architec-

tures and implementations.

Our experimental results show that our implementation offers higher perfor-

mance than Hadoop, the open source and the most popular MapReduce implementation.

Moreover, our experiment describes advantages and drawbacks of each strategy. Finally,

we have noted limitations that lack of the GPU restrictions and can motivate further work

on this topic.

Page 43: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 35 -

References

[1] C. Chu, S. K. Kim, Y. A. Lin et al., “Map-reduce for machine learning on multicore,”

Advances in neural information processing systems, vol. 19, pp. 281, 2007.

[2] J. Dean, and S. Ghemawat, “MapReduce: simplified data processing on large clusters,”

Commun. ACM, vol. 51, no. 1, pp. 107-113, 2008.

[3] A. Ene, S. Im, and B. Moseley, "Fast clustering using MapReduce." pp. 681-689.

[4] R. L. Ferreira Cordeiro, C. Traina Junior, A. J. Machado Traina et al., "Clustering very large

multi-dimensional datasets with MapReduce." pp. 690-698.

[5] Y. He, H. Tan, W. Luo et al., "MR-DBSCAN: An efficient parallel density-based clustering

algorithm using MapReduce." pp. 473-480.

[6] S. Nair, and J. Mehta, "Clustering with Apache Hadoop." pp. 505-509.

[7] C. Wang, M. Guo, and Y. Liu, "EST Clustering in Large Dataset with MapReduce." pp.

968-971.

[8] B. Wu, Y. Dong, Q. Ke et al., "A parallel computing model for large-graph mining with

mapreduce." pp. 43-47.

[9] H. Yang, A. Dasdan, R. L. Hsiao et al., "Map-reduce-merge: simplified relational data

processing on large clusters." pp. 1029-1040.

[10] S. Ghemawat, H. Gobioff, and S. T. Leung, "The Google file system." pp. 29-43.

[11] D. Judd, P. K. McKinley, and A. K. Jain, “Large-scale parallel data clustering,” Pattern

Analysis and Machine Intelligence, IEEE Transactions on, vol. 20, no. 8, pp. 871-876,

1998.

[12] M. D. Linderman, J. D. Collins, H. Wang et al., “Merge: a programming model for

heterogeneous multi-core systems,” SIGOPS Oper. Syst. Rev., vol. 42, no. 2, pp. 287-296,

2008.

[13] X. Xu, J. Jäger, and H. P. Kriegel, “A fast parallel clustering algorithm for large spatial

databases,” High Performance Data Mining, pp. 263-290, 2002.

[14] B. Catanzaro, N. Sundaram, and K. Keutzer, "Fast support vector machine training and

classification on graphics processors." pp. 104-111.

[15] D. Steinkraus, I. Buck, and P. Simard, "Using GPUs for machine learning algorithms." pp.

1115-1120.

Page 44: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 36 -

[16] S. A. Manavski, and G. Valle, “CUDA compatible GPU cards as efficient hardware

accelerators for Smith-Waterman sequence alignment,” BMC bioinformatics, vol. 9, no.

Suppl 2, pp. S10, 2008.

[17] M. C. Schatz, C. Trapnell, A. L. Delcher et al., “High-throughput sequence alignment using

Graphics Processing Units,” BMC bioinformatics, vol. 8, no. 1, pp. 474, 2007.

[18] L. Weiguo, B. Schmidt, G. Voss et al., “Streaming algorithms for biological sequence

alignment on GPUs,” Parallel and Distributed Systems, IEEE Transactions on, vol. 18, no.

9, pp. 1270-1281, 2007.

[19] N. K. Govindaraju, B. Lloyd, W. Wang et al., "Fast computation of database operations

using graphics processors." pp. 215-226.

[20] B. He, K. Yang, R. Fang et al., "Relational joins on graphics processors." pp. 511-524.

[21] W. Fang, K. K. Lau, M. Lu et al., “Parallel data mining on graphics processors,” Hong

Kong University of Science and Technology, Tech. Rep. HKUST-CS08-07, 2008.

[22] I. Chiosa, and A. Kolb, “GPU-based multilevel clustering,” Visualization and Computer

Graphics, IEEE Transactions on, vol. 17, no. 2, pp. 132-145, 2011.

[23] AMD_CTM. "http://www.amd.com/us/products/Pages/products.aspx."

[24] NVIDIA_CUDA. "http://developer.nvidia.com/object/cuda.html."

[25] Wikimedia, “http://dumps.wikimedia.org/.”

[26] Apache_Hadoop. "http://hadoop.apache.org/."

[27] R. Chen, H. Chen, and B. Zang, "Tiledmapreduce: optimizing resource usages of data-

parallel applications on multicore with tiling." pp. 523-534.

[28] C. Ranger, R. Raghuraman, A. Penmetsa et al., "Evaluating mapreduce for multi-core and

multiprocessor systems." pp. 13-24.

[29] J. Talbot, R. M. Yoo, and C. Kozyrakis, "Phoenix++: modular MapReduce for shared-

memory systems." pp. 9-16.

[30] B. Catanzaro, N. Sundaram, and K. Keutzer, "A map reduce framework for programming

graphics processors."

[31] B. He, W. Fang, Q. Luo et al., "Mars: a MapReduce framework on graphics processors." pp.

260-269.

[32] J. A. Stuart, and J. D. Owens, "Multi-GPU MapReduce on GPU clusters." pp. 1068-1079.

[33] Y. Shan, B. Wang, J. Yan et al., "FPMR: Mapreduce framework on fpga." pp. 93-102.

[34] M. M. Rafique, B. Rose, A. R. Butt et al., "CellMR: A framework for supporting mapreduce

Page 45: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 37 -

on asymmetric cell-based clusters." pp. 1-12.

[35] C. Hong, D. Chen, W. Chen et al., "MapCG: writing parallel program portable between

CPU and GPU." pp. 217-226.

[36] I. Buck, T. Foley, D. Horn et al., "Brook for GPUs: stream computing on graphics

hardware." pp. 777-786.

[37] KHRONOS_GROUP. "http://www.khronos.org/."

Page 46: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 38 -

요 약 문

멀티프로세서로 구성된 싱글 컴퓨팅 노드상의 맵리듀스 아키텍쳐

최근의 CPU 마이크로 아키텍처 디자인의 패러다임은 온-칩 멀티코어 프로세서와

NVIDIA’s Tesla 및 Intel’s Xeon Phi 와 같은 매니코어 코-프로세서로 변화하고 있다. 한편,

MapReduce 프레임워크는 저 비용 노드들의 대규모 클러스터 기반의 빅 데이터 분석에

광범위하게 사용되고 연구 되고 있다. 본 논문은 다수의 멀티코어 CPU 들과 매니코어 GPU 들로

구성되어 있는 단일 노드를 프로세서들의 클러스터로 간주하여 Hybrid-core based big Data

(Real-time) Analysis (HYDRA)라는 새로운 MapReduce 프레임워크를 제안한다. 이때, 하나의

프로세서는 하나의 노드의 역할을 수행한다. HYDRA 는 현대의 이기종 코어 시스템의 컴퓨팅

파워를 최대한 활용하도록 설계됨으로써 단일 노드상의 HYDRA 가 소규모의 다중 노드

클러스터상의 MapReduce 와 유사한 성능을 발휘할 수 있도록 한다. 특히, HYDRA 는 공유 메모리

아키텍쳐를 기반으로 하고 있어서 기존의 MapReduce 의 셔플 단계에서 발생할 수 있는

네트워크를 통한 과도한 데이터 전송 비용을 가지지 않는다. 본 논문은 HYDRA 프레임워크

하에서 "Processor As A Node" (PAAN) 와 "GPU Mapper CPU Reducer" (GMCR)의 두 가지 전략을

제안한다. PAAN 은 하나의 CPU 또는 GPU 를 하나의 컴퓨팅 노드로 간주하는 전략이다. 반면,

GMCR 은 GPU 들은 맵퍼 노드들로서만, CPU 들은 리듀서 노드들로서만 작동시키는 전략이다.

제안한 두 전략들은 (1) 서로 다른 특성을 지닌 CPU 와 GPU 사이의 협력 문제, (2) 그들

프로세서들이 가진 서로 다른 메모리 계층 구조를 관리하는 문제, (3) CPU 와 GPU 사이의

데이터 송/수신 비용을 줄이는 문제들에 대한 해결책들을 제시한다. 마지막으로 다양한

실험들의 결과를 통해 제안한 HYDRA 가 소규모 클러스터(노드 개수 8 개) 상에서의

MapReduce 보다 14 배 이상 좋은 성능을 발휘함을 보인다.

핵심어: 맵리듀스, 이기종 컴퓨팅, 범용 GPU, 멀티코어, 매니코어

Page 47: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 39 -

Acknowledgement

I would like to express my gratitude to all those who gave me the possibility to

complete this thesis. Above all, I am deeply indebted to my supervisor Prof. Min-Soo, Kim

whose help, is stimulating suggestions and encouragement helped me in all the time of

research for and writing of this thesis. My colleagues from the Department of Information

and Communication Engineering supported me in my research work. I want to thank them

for all their help, support, interest and valuable hints. Especially, I would like to give my

special thanks to my family whose patient love enabled me to complete this work.

Page 48: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 40 -

CURRICULUM VITAE

Hyochan Song

05.04.1984

Education

Master of Science in Information & Communication Engineering, Mar. 2011 – Feb. 2013

DGIST (Daegu Gyeongbuk Institute of Science and Technology), Daegu, Korea

Bachelor of Science in Computer Science, Mar. 2003 – Feb. 2011

Handong Global University, Pohang, Korea

Work Experience

Teaching Assistant in Computer Networks, Fall Semester in 2010

Prof. Koono Kim, Handong Global University, Pohang, Korea

Teaching Assistant in Introduction to Programming, Fall Semester in 2009

Prof. Kyungmi Kim, Handong Global University, Pohang, Korea

Intern in Nawooenc, May 2009 – Oct. 2009

MFC (Microsoft Foundation Class) programming

Data acquisition and processing from PLC (Power Line Communication)

Assistant Manager in KoreaEnC, Dec. 2004 – July 2007

GIS database construction and maintenance

Professional Activities

Staff in Samsung Software Membership (SSM), Jan 2010 – present

Smart Driving Assistor, Mar. 2010 – Aug. 2010, Daegu, Korea

Shopaholic, Sept. 2010 – Feb. 2011

Story Teller using HTML5, Mar. 2011 – Aug. 2011

Page 49: MapReduce Architecture for a Single Computing Node of ... · The study was conducted in accordance with Code of Re-search Ethics1 11. 15. 2012 Approved by Professor Min-Soo Kim (

- 41 -

Computer Research Association (CRA), Jun. 2003 – Feb. 2011

Managing intranet (i3) in Handong Global University, Pohang, Korea

Barcode recognizer by using OpenCV, Dec. 2007 – Mar. 2008

Smart Movie-Player, Jun. 2008 – Aug. 2008

Smart Movie-Marker, Jun. 2008 – Aug. 2008

President, Jun. 2009 – May. 2010

Honors and Awards

Delight Exhibition, Nov. 2011

Shopaholic, Samsung Software Membership, Daegu, Korea

S-class Project, May 2011

Shopaholic, Samsung Software Membership, Daegu, Korea

Popularity Award in Engineering Week, Nov. 2008

Smart Movie-Player, Handong Global University, Pohang, Korea

Excellence Award in Capstone Design, Oct. 2008

Smart Movie-Marker, Handong Global University, Pohang, Korea