Top Banner
A Seminar-II Rep ort On Pipelined-MapReduce: An improved MapReduce Parallel programing mo delsubmitted in partial fulfilment of the r e quir ements for the award of the degree of First Year Master of Engineering in Computer Engineering By Prof.Deshmukh Sachin B. Under the guidance of Prof. B.L.Gunjal DEPARTMENT OF COMPUTER ENGINEERING AMRUTVAHINI COLLEGE OF ENGINEERING. SANGAMNER A/P-GHULEW ADI-422608, TAL. SANGAMNER, DIST. AHMEDNAGAR (MS) INDIA 2013-2014
40
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pipelined-MapReduce an Improved MapReduce

ASeminar-II Report

On

“Pipelined-MapReduce: An improved MapReduceParallel programing model”

submitted in partial fulfilment of the requirements

for the award of the degree of

First Year Master of Engineering

in

Computer Engineering

By

Prof.Deshmukh Sachin B.

Under the guidance of

Prof. B.L.Gunjal

DEPARTMENT OF COMPUTER ENGINEERING

AMRUTVAHINI COLLEGE OF ENGINEERING.SANGAMNER

A/P-GHULEWADI-422608, TAL. SANGAMNER, DIST. AHMEDNAGAR (MS) INDIA

2013-2014

Page 2: Pipelined-MapReduce an Improved MapReduce

DEPARTMENT OF COMPUTER ENGINEERING AMRUTVAHINI

COLLEGE OF ENGINEERING. SANGAMNER A/P-GHULEWADI-

422608, TAL. SANGAMNER, DIST. AHMEDNAGAR (MS) INDIA

2013-14

CERTIFICATEThis is to certify that Seminar-II report entitled

“Pipelined-MapReduce: An improved MapReduceParallel programing model”

is submitted as partial fulfilment of curriculum of First Year M.E.

of Computer Engineering

ByProf.Deshmukh Sachin B.

Prof. B.L.Gunjal Prof. B.L.Gunjal

(Seminar Guide) (Seminar Coordinator)

Prof. R.L.Paikrao Prof. Dr.G.J. Vikhe Patil

(H.O.D.) (Principal)

Page 3: Pipelined-MapReduce an Improved MapReduce

University of Pune

CERTIFICATE

This is to Certify that

Prof.Deshmukh Sachin B.

Student of

First Year M.E.(Computer Engineering)

was examined in Dissertation Report entitled

”Pipelined-MapReduce: An improved MapReduceParallel programing model”

on / /2013

at

Department of Computer Engineering,

Amrutvahini College of Engineering,Sangamner - 422608

(. . . . . . . . . . . . . . . . . . . . .) (. . . . . . . . . . . . . . . . . . . . .)

Prof. B.L.Gunjal Prof. A.N.Nawathe

Internal Examiner External Examiner

Page 4: Pipelined-MapReduce an Improved MapReduce

ACKNOWLEDGEMENT

i

As wise person said about the nobility of the teaching profession, Catch a fish and

you feed a man his dinner, but teach a man how to catch a fish, and you feed him for

life.

In this context, I would like to thank my helpful seminar guide Prof. B.L. Gunjal

who had been an incessant source of inspiration and help. Not only did she inspire me

to undertake this assignments, she also advise me throughout its course and help me

during my times of trouble. I would like to thank H.O.D. of Department of Computer

Engineering Prof. Paikrao R. L. and M. E. co-ordinator Prof. Gunjal B. L. for motivating

me.

Also, I would like to thank other member of Computer Engineering department who

helped me to handle this assignment efficiently, and who were always ready to go out of

their were to help us during our times of need.

Prof.Deshmukh Sachin B.

M.E.Computer Engineering

Page 5: Pipelined-MapReduce an Improved MapReduce

ABSTRACT

ii

Cloud MapReduce (CMR) is a framework for processing large data sets of batch data

in cloud. The Map and Reduce phases run sequentially, one after another. This leads to:

1.Compulsory batch processing

2.No parallelization of the map and reduce phases

3.Increased delays.

The current implementation is not suited for processing streaming data. We propose a

novel architecture to support streaming data as input using pipelining between the Map

and Reduce phases in CMR, ensuring that the output of the Map phase is made available

to the Reduce phase as soon as it is produced. This ’Pipelined MapReduce’ approach

leads to increased parallelism between the Map and Reduce phases; thereby

1.Supporting streaming data as input.

2.Reducing delays

3. Enabling the user to take ’snapshots’ of the approximate output generated in a

stipulated time frame.

4. Supporting cascaded MapReduce jobs. This cloud implementation is light-weight and

inherently scalable.

Page 6: Pipelined-MapReduce an Improved MapReduce

iii

INTRODUCTION 1

1.1 Need of system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

LITERATURE SURVEY 3

2.1 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Mappers and Reducers: . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Introduction to Hadoop: . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 design Of HDFS: . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.3 Basic concepts of HDFS: . . . . . . . . . . . . . . . . . . . . . . . 6

Contents

Acknowledgement

i Abstract

ii List of Figures

v

1

2

2.3 Cloud Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Online Map Reduce(Hadoop Online Prototype) . . . . . . . . . . 8

3 3. PROPOSED ARCHITECTURE 9

3.1 Architecture of Pipelined CMR . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.2 Mapper Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.3 Reduce Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 The first design option: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Handling Reducer Failures . . . . . . . . . . . . . . . . . . . . . . 11

Page 7: Pipelined-MapReduce an Improved MapReduce

iv

3.3 The second design option: . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Page 8: Pipelined-MapReduce an Improved MapReduce

iv

3.4 Hybrid Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Pipelined Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Hadoop and HDFS 15

4.1 Hadoop System Architecture: . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Operating a Server Cluster: . . . . . . . . . . . . . . . . . . . . . 15

4.2 Data Storage and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Comparison with Other Systems: . . . . . . . . . . . . . . . . . . . . . . 17

4.3.1 RDBMS: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 Hadoop Distributed File System (HDFS): . . . . . . . . . . . . . . . . . 19

4.4.1 HDFS Federation: . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4.2 HDFS High-Availability . . . . . . . . . . . . . . . . . . . . . . . 20

4.5 Algorithm: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.5.1 Algorithm For Word Count: . . . . . . . . . . . . . . . . . . . . . 21

5 ADVANTAGES, FEATURES AND APPLICATIONS 23

5.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Time windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.2 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2.3 Cascaded MapReduceJobs . . . . . . . . . . . . . . . . . . . . . . 24

Conclusion 24

References 25

Page 9: Pipelined-MapReduce an Improved MapReduce

v

List of Figures

2.1 Map Reduce Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Architecture Of Complete Hadoop Structure . . . . . . . . . . . . . . . . 5

2.3 Architecture Of Hadoop Distributed File System . . . . . . . . . . . . . . 7

3.1 Architecture of Pipelined Map Reduce . . . . . . . . . . . . . . . . . . . 11

3.2 Hadoop Data flow for Batch And Pipelined Map Reduce Data Flow . . . 13

4.1 Hadoop System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Operation Of Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . 16

Page 10: Pipelined-MapReduce an Improved MapReduce

1

Chapter 1

INTRODUCTION

1.1 Need of system

Cloud Map Reduce (CMR) is gaining popularity among small companies for pro-

cessing large data sets in cloud environments. The current implementation of CMR

is designed for batch processing of data. Today, streaming data constitutes an impor-

tant portion of web data in the form of continuous click-streams, feeds, micro-blogging,

stock-quotes, news and other such data sources. For processing streaming data in Cloud

MapReduce (CMR), significant changes are required to be made to the existing CMR ar-

chitecture. We use pipelining between Map and Reduec phases as an approach to support

stream data processing, in contrast to the current implementation where the Reducers

do not start working unless all Mappers have finished. Thus, in our architecture, the

Reduce phase too gets a continuous stream of data and can produce continuous output.

This new architecture is explained in subsequent section.

Computing and data intensive data processing are increasingly prevalent . In the

near future , it is expected that the data volumes processed by applications will cross the

peta-scale threshold , and increase the computional requirements .

1.2 Map Reduce

Google’s MapReduce is a programming model and process, resulting in large data

sets related to implementation. MapReduce is a programming model, it is with the

processing / production related to implementation of large data sets. Users specify a

Page 11: Pipelined-MapReduce an Improved MapReduce

Chapter1 INTRODUCTION

map function, through the map function handles key/value pairs, and produce a series

of intermediate key/value pairs and use the reduce function to combine all of the key

values have the same middle button pair The value part. MapReduce allows program

developers to data-centric way of thinking: a focus on the application data record set

conversion, and allows the implementation of the MapReduce framework for distributed

processing, network communication, coordination and fault-tolerance and other details.

MapReduce model is usually applied to the completion of the major concerns of

large quantities of computing time. Google MapReduce framework and the open source

Hadoop framework through the implementation of strategies to strengthen the batch

using the model: each map and reduce all of the output stage of the next stage is to

be materialized before the consumer to stable storage, or produce output. Batch entity

allows a simple and practical of the checkpoint,restart fault tolerance, which is critical

for large deployments.

Piped-MapReduce is an improved implementation of the intermediate data transfer

in the operating room with the pipe, while retaining the structure before the MapReduce

programming interface and fault tolerance mode. Piped- MapReduce has many advan-

tages, downstream elements of the data elements can be completed in the producer began

before the implementation of consumption data, which can increase the opportunities for

parallelism and improve efficiency and reduce response time. Because of a production

data reducers mappers start treatment, they can be generated in the implementation of

projects and improve the final result of the approximation. Piped-MapReduce MapRe-

duce broadened the field can be applied.

Pipelined-MapReduce: An improved MapReduce Parallel programing model P2

Page 12: Pipelined-MapReduce an Improved MapReduce

Chapter 2

LITERATURE SURVEY

2.1 Map Reduce

MapReduce is a programming model developed by Google for processing large data

of each other and can hence be processed in parallel fashion. Each split consists of a set

of (key, value) pairs that form records. The splits are divided among sets in a distributed

fashion. The model consists of two phases: a Map phase and a reduce phase. Initially,

the data is divided into smaller ’splits’. These splits of data are independent computing

nodes called Mappers. Mappers map the input (key, value) pairs into intermediate (keyi,

valuei) pairs. The following stage consists of Reducers which are computing nodes that

take the intermediate (keyi, valuei) pairs generated by the Mappers. The Reducers

combine the set of values associated with a particular keyi obtained from all the Mappers

to produce a final output in the form of (keyi, value). For e.g. the often cited word count

example elegantly illustrates the computing phases of MapReduce. A large document is

to be scanned and the numbers of occurrences of each word are to be determined. The

solution using MapReduce proceeds as follows:

1. The document is divided into several splits (ensuring that no split is such that it

results into splitting of a whole word). The ID of a split is the input key, and the actual

split (or a pointer to it) is the value corresponding to that key. Thus, the document is

divided into a number of splits, each containing a set of (key, value) pairs.

2. There is a set of m Mappers. Each mapper is assigned an input split based on

some scheduling policy (FIFO, Fair scheduling etc.) (If the mapper consumes its split, it

asks for the next split). Each Mapper processes each input (key, value) pair in the split

assigned to it according to some user defined Map function to produce intermediate key

value pairs. In this case, a Mapper, for 2 occurrences of the word ’cake’ and 3 occurrences

3

Page 13: Pipelined-MapReduce an Improved MapReduce

Chapter2 LITERATURE SURVEY

Pipelined-MapReduce: An improved MapReduce Parallel programing model P4

of the word ’pastry’ outputs the following: (Cake, 1) (Cake, 1) (Pastry, 1) (Pastry, 1)

(Pastry, 1) Here, each intermediate (key, value) pair indicates one occurrence of the key.

3. The set of intermediate (key, value) pairs is thenpulled by the Reducer workers

from the Mappers.

4. The Reducers then work to combine all the values associated with a particular

intermediate key (keyi) according to a user defined Reduce function. Thus, in the above

example, a Reducer worker, on processing the set of intermediate (key, value) pairs of

word count, would output: (Cake, 2) (Pastry, 3)

5. This output is then written to disk as the final output of the MapReduce job, which

gives the count of all words in the document in this example. An optional Combiner phase

is usually included at the Mapper worker to combine the local result of the Mapper before

sending it over to the Reducer. This works to reduce network traffic between the Mapper

and the Reducer.

2.1.1 Mappers and Reducers:

The Map Reduce framework operates exclusively on ¡key, value¿ pairs, that is, the

framework views the input to the job as a set of ¡key, value¿ pairs and produces a set

of ¡Key, value¿ pairs as the output of the job, conceivably of different types. Key-value

pairs form the basic data structure in Map Reduce. Keys and values may be primitives

such as integers, oating point values, strings, and raw bytes, or they may be arbitrarily

complex structures (lists, tuples, associative arrays, etc.).

In Map Reduce, the programmer de nes a mapper and a reducer with the following sig-

natures:

Map: (k1; v1) ? [(K2; v2)]

Reduce: (K2; [v2]) ? [(k3; v3)]

Hence, the complete Hadoop Cluster i.e. HDFS and Map Reduce is look like as

shown in following figure.

Figure suggests, the programmer can specify that the system sort the bins contents

according to any provided criterion. This customization occasionally can be useful. The

programmer can also specify partitioning methods to more coarsely or more finely bin

the map results. Although every job must specify both a mapper and a reducer, either

Page 14: Pipelined-MapReduce an Improved MapReduce

Chapter2 LITERATURE SURVEY

Pipelined-MapReduce: An improved MapReduce Parallel programing model P5

Figure 2.1: Map Reduce Process

Figure 2.2: Architecture Of Complete Hadoop Structure

of them can be the identity, or an empty stub that creates output records equal to the

input records. In particular, an algorithm frequently calls for input categorization or

ordering for many successive stages, suggesting a reducer sequence without the need for

intermediate mapping; identity mappers fill in to satisfy architectural needs.

2.2 Hadoop

Hadoop is an implementation of the MapReduce programming model developed by

Apache. The Hadoop framework is used for batch processing of large data sets on a

physical cluster of machines. It incorporates a distributed file system called Hadoop Dis-

tributed File System (HDFS), a Common set of commands, scheduler, and the MapRe-

Page 15: Pipelined-MapReduce an Improved MapReduce

Chapter2 LITERATURE SURVEY

Pipelined-MapReduce: An improved MapReduce Parallel programing model P6

duce evaluation framework. As the entire complexity of cluster management, redundancy

in HDFS, consistency and reliability in case of node failure is included in the framework

itself, the code base is huge: around 3,00,000 lines of code (LOC). Hadoop is popular

for processing huge data sets, especially in social networking, targeted advertisements,

internet log processing etc.

2.2.1 Introduction to Hadoop:

Technically, Hadoop consists of two key services: reliable data storage using the

Hadoop Distributed File System (HDFS) and high-performance parallel data processing

using a technique called Map Reduce. Hadoop runs on a collection of commodity, shared-

nothing servers.

2.2.2 design Of HDFS:

Very large in this context means files that are hundreds of megabytes, gigabytes, or

terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Very large files:

Very large in this context means files that are hundreds of megabytes, gigabytes, or

terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access:

HDFS is built around the idea that the most efficient data processing pattern is a

write-once, read-many-times pattern.

Commodity hardware:

Hadoop doesnt require expensive, highly reliable hardware to run on. Its designed to

run on clusters of commodity hardware.

2.2.3 Basic concepts of HDFS:

Blocks:

HDFS has the concept of a block, it is a much larger unit64 MB by default. Like in a

file system for a single disk, files in HDFS are broken into block-sized chunks, which are

Page 16: Pipelined-MapReduce an Improved MapReduce

Chapter2 LITERATURE SURVEY

Pipelined-MapReduce: An improved MapReduce Parallel programing model P7

stored as independent units. Unlike a file system for a single disk, a file in HDFS that is

smaller than a single block does not occupy a full blocks worth of underlying storage.

Name Node:

The name node manages the file system namespace. It maintains the file system tree

and the metadata for all the files and directories in the tree. This information is stored

persistently on the local disk in the form of two files: the namespace image and the edit

log. The name node also knows the data nodes on which all the blocks for a given file

are located, however, it does not store block locations persistently, since this information

is reconstructed from data nodes when the system starts. Without the name node, the

file system cannot be used.

Data Node:

Data nodes are the work horses of the file system. They store and retrieve blocks

when they are told to and they report back to the name node periodically with lists of

blocks that they are storing. The following diagram shows the basic HDFS architecture.

Figure 2.3: Architecture Of Hadoop Distributed File System

2.3 Cloud Map Reduce

Cloud MapReduce is a light-weight implementation of MapReduce programming

model on top of the Amazon cloud OS, using Amazon EC2 instances. It is very compact

Page 17: Pipelined-MapReduce an Improved MapReduce

Chapter2 LITERATURE SURVEY

Pipelined-MapReduce: An improved MapReduce Parallel programing model P8

(around 3000 LOC) as compared to Hadoop. It also gives significant performance im-

provements (in one case, over 60x) over traditional Hadoop. The architecture of CMR,

as described in consists of one input queue, multiple reduce queues which act as stag-

ing areas for holding the intermediate (key, value) pairs produced by the Mappers, a

master reduce queue that holds the pointers to the reduce queues, and an output queue

that holds the final results. In addition, S3 file system is used to store the data to be

processed, and SimpleDB is used to communicate the status of the worker nodes which

is then used to ensure consistency and reliability. Initially all the queues are set up.

The user then puts the data (key, value) pairs to be processed in the input SQS queue.

Typically, the key is a unique ID and the value is a pointer into S3. The Mappers poll

the queue for work whenever they are free. Then, they dequeue one message from the

queue and process it according to the user defined Map function. The intermediate (key,

value) pairs are pushed by the Mappers to the intermediate Reduce queues as soon as

they are produced.

The choice of a reduce queue is made by hashing on the intermediate key generated,

to ensure even load balancing, and also to ensure that the records with the same key land

up in the same reduce queue. After the Map phase is complete, each Reducer dequeues a

pointer to one of the reduce queues from the master reduce queue and process the reduce

queue associated with that pointer. They apply the user defined Reduce function on the

reduce queue by sorting the queue by intermediate key, merging the values associated

with a key, and then writing the final output to the Output queue. This is again batch

processing, as the user submits a batch of data to be processed. Also, the Reducer work-

ers don?t start working unless all the Mappers have finished all their Map tasks. Thus,

it is not suited for processing streaming data.

2.4 Online Map Reduce(Hadoop Online Prototype)

It is a modification to traditional Hadoop framework that incorporates pipelining be-

tween the Map and Reduce phases, thereby supporting parallelism between these phases,

and providing support for processing streaming data. The output of the Mappers is made

available to the Reducers as soon as it is produced. A downstream dataflow element can

begin processing before an upstream producer finishes. It carries out online aggregation

of data to produce incrementally correct output. This also supports continuous queries.

Page 18: Pipelined-MapReduce an Improved MapReduce

Chapter2 LITERATURE SURVEY

Pipelined-MapReduce: An improved MapReduce Parallel programing model P9

This can also be used with batch data, and gives approximately 30 better throughput

and response time because of parallelism between the Map and Reduce phases. It also

supports ’snapshots’ of output data where the user can get a view of the output produced

till some time instant instead of waiting for the entire batch of data to finish processing.

Also, Cascaded MapReduce jobs with pipelining are supported in this implementation,

whereby the output of one job is fed to the input of another job as soon as it is produced.

This leads to parallelism within a job, as well as between jobs. The architecture is par-

ticularly suitable for processing streaming data, as it is modelled as dataflow architecture

with windowing and online aggregation.

Page 19: Pipelined-MapReduce an Improved MapReduce

Chapter 3

3. PROPOSED ARCHITECTURE

The above implementations suffer from the following drawbacks:

1. HOP is unsuitable for cloud, as Hadoop is a framework for distributed computing.

Hence, it lacks the inherent scalability and flexibility of cloud.

2. In HOP, code for handling of HDFS, reliability, scheduling etc. is a part of the Hadoop

framework itself, and hence makes it large and heavy-weight (around 3lac LOC).

3. Cloud MapReduce does not support stream data processing, which is an important

use case. Our proposal aims at bridging this gap between heavyweight HOP and the

light-weight, scalable Cloud MapReduce implementation, by providing support for pro-

cessing stream data in Cloud MapReduce. This architecture is easier to build, maintain

and run as the underlying complexity of managing resources in cloud is handled by the

Amazon cloud OS, in contrast to a Hadoop-like framework where the issues of handling

filesystem, scheduling, availability, reliability are done in the framework itself. The chal-

lenges involved in the implementation include:

1. Providing support for streaming data at input

2. A novel design for output aggregation

3. Handling Reducer failures

4. Handling windows based on timestamps.

Currently, no open-source implementation exists for processing streaming data using

MapReduce on top of Cloud. To the best of our knowledge, this is the first such attempt

to integrate stream data processing capability with MapReduce on Amazon Web Services

using EC2 instances. Such an implementation will provide significant value to stream

processing applications such as the ones outlined in section IV. We now describe the

architecture of the Pipelined CMR approach.

10

Page 20: Pipelined-MapReduce an Improved MapReduce

Chapter3 3. PROPOSED ARCHITECTURE

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1111

3.1 Architecture of Pipelined CMR

3.1.1 Input

A drop-box concept can be used, where a folder on S3 (for example) is used to hold the

data that is to be processed by Cloud MapReduce. The user is responsible for providing

data in the drop-box from which it will be sent to the input SQS queue. That way, the

user can control the data that is sent to the drop-box and can use the implementation

in either a batch or a stream processing manner. The data provided by the user must

be in (key, value) format. An Input Handler function running on one EC2 instance polls

the drop-box for new data. When a new record of (key, value) format is available in the

drop-box, the Input Handler appends it with a unique MapID, Timestamp, and Unique

Tag to the record and generates a record of a new form: (Metadata, Key, Value). The

addition of Timestamp is important for taking ’snapshots’ of the output (described later)

and for preserving the order of data processing in case of stragglers. The Input Handler

then pushes this generated record into SQS for processing by the Mapper.

3.1.2 Mapper Operation

Input Queue-Mapper interface is unchanged from the CMR implementation [2]. The

Mapper, whenever it is free, pops one message from the input SQS queue thereby re-

moving the message from the queue for a visibility timeout and processes it according to

the user-defined Map function. If the Mapper fails, the message re-appears on the Input

queue after the visibility timeout. If the Mapper is successful, it deletes the message from

the queue and writes its status to SimpleDB. Mapper generates a set of intermediate (key,

value) pairs for each input split processed.

3.1.3 Reduce Phase

The Mapper writes the intermediate records produced to ReduceQueues. Reduce-

Queues are intermediate staging queues implemented using SQS for holding the mapper

output. Reducers take their input from these queues. The Reduce phase as implemented

in CMR will have to be modified to process Stream Data. For this, we considered two

design options:

Page 21: Pipelined-MapReduce an Improved MapReduce

Chapter3 3. PROPOSED ARCHITECTURE

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1212

Figure 3.1: Architecture of Pipelined Map Reduce

3.2 The first design option:

As in CMR, the Mapper pushes each intermediate (key, value) pair to one of the

reduce queues based on hash value of the intermediate key. The hash function can be

user- defined or a default function provided, as in CMR. The number of reduce queues

must be at least equal to the number of reducers and preferably much larger to balance

load better. Unlike the CMR implementation, where a reducer can process any of the

reduce queues by dequeueing on message from the master reduce queue (which holds

pointers to all the ReduceQueues), in this case, the reducer will have to be statically

bound to one or more reduce queues, so that records with a particular intermediate

key are always processed by the same reducer. This is essential to aggregate values

associated with the same key in the Reduce phase. Also, each reducer will be statically

bound to an output split associated with that reducer, so that during online aggregation

of values associated with a particular reduce key, a particular key does not land up

in two different output splits, giving incorrect aggregation. Thus, each reducer will

be bound statically to: one or more reduce queues and one output split. This can

be achieved by maintaining a database Containing (ReducerID, ReduceQueuePointers,

OutputQueuePointers, Status, timeLastUpdated) Or an equivalent data-structure in non-

relational database, like Simple DB. In the static implementation, ReduceQueuePointers

hold the pointers to Reduce queues associated with a particular Reducer identified by

the ReducerID. OutputQueuePointers will hold Pointers to OutputQueues (or rather

OutputFilePointers) associated with a particular ReducerID.

Page 22: Pipelined-MapReduce an Improved MapReduce

Chapter3 3. PROPOSED ARCHITECTURE

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1313

3.2.1 Handling Reducer Failures

The Status field is used for handling Reducer failures. Status can be one of Live,

Dead, Idle. Each Reducer periodically updates the isAlive bit in its Status field to

indicate that it has not failed (similar to the heartbeat signal in traditional Hadoop). A

thread running in an EC2 instance monitors to see if any reducer has not updated its

status in the past TIMEOUT seconds. If it finds a Reducer which has not updated its

status in such time, its sets its Status to Dead, searches for a Reducer with Idle Status

and assigns the new Reducer the ReduceQueuePointers and OutputPointers previously

held by the old Reducer. When the visibility timeout expires, the messages processed

by the failed Reducer again become visible on the ReduceQueues. As the Reducer does

not update its Status to Done in the SimpleDB unless it has finished processing a given

window of the input and removed the corresponding messages from the ReduceQueues, it

is guaranteed that if the Reducer fails, the messages will reappear on the ReduceQueues,

as SQS requires that messages be specifically deleted, or else they re-appear after the

visibility timeout. Thus, Idle Reducers are used to handle other Reducer failures.

Also, if on getting the job of some other failed reducer, the original queues of the Idle

reducer start filling up, it will also process those queues, as the queues associated with the

failed reducer are added to the original list of queues associated with that reducer, without

replacement. The issue of which output split to write to, when a record is read from a

reduce queue (because it may be read from the original queue associated with a reducer,

or a new queue assigned because of some other Reducer’s failure) can be determined by

holding another database table that associates a particular reduce queue to a particular

output split. This is essential for correct aggregation. The Reducer then writes the output

generated to the correct output split, checking for duplicates and if the key already

exists, aggregating the new output with the output already available. This is called

online aggregation. Thus, the following sequence of steps occurs for each intermediate

(key, value) pair generated by the mapper (ignoring here the issues involved in handling

reducer failures that have been discussed above): 1. Mapper pushes intermediate (key,

value) pairs to one of the reduce queues based on hash value of the key. 2. Each reducer

is statically bound to one or more ReduceQueues, and polls those queues only for work.

3. It then writes the output to the output split assigned to it.

Page 23: Pipelined-MapReduce an Improved MapReduce

Chapter3 3. PROPOSED ARCHITECTURE

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1414

3.3 The second design option:

Alternatively, we can have a single queue between the Mappers and the Reducers,

with all the intermediate (key, value) pairs generated by all the Mappers pushed to this

intermediate queue. Reducers poll this IntermediateQueue for records. Aggregation is

carried out as follows: There are a fixed number of Output splits. Whenever a Reducer

reads a (key, value) record from the IntermediateQueue, it applies a user-defined Reduce

function to the record, to produce an output (key, value) pair. It then selects an output

split by hashing on the output key produced. If there already exists a record for that key

in the output split, the new value is merged with the existing value. Otherwise a new

record is created with (key, value) as the output record. An issue with this approach is

that, for aggregating the records directly in S3 the issue of latency of S3 must be taken

into account. If the delays are too long, the first design option would be better.

3.4 Hybrid Approach

Both the above approaches could be combined as follows: Have multiple Reduce-

Queues, each linked statically to a particular Reducer, but instead of linking the output

splits to the Reducer statically, use hashing on the output key of the (key, value) pair

generated by the Reducer to select an output split. This will require fewer changes to the

existing CMR architecture, but will involve static linking of Reducers to ReduceQueues.

This is the approach that we prefer.

3.5 Pipelined Architecture

Figure depicts the dataflow of two MapReduce implementations. The dataflow on

the left corresponds to the output materialization approach used by Hadoop; the dataflow

on the right allows pipelining and We called it Pipelined-MapReduce. In the remainder

of this section, we present our design and implementation for the Pipelined-MapReduce

dataflow.

In general , reduce tasks traditionally issue HTTP requests to pull their output from

each Task-Tracker. This means that map task execution is completely decoupled from

reduce task execution. To support pipelining, we modified the map task to instead push

data to reducers as it is produced. To give an intuition for how this works, we begin by

Page 24: Pipelined-MapReduce an Improved MapReduce

Chapter3 3. PROPOSED ARCHITECTURE

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1515

Figure 3.2: Hadoop Data flow for Batch And Pipelined Map Reduce Data Flow

describing a straightforward pipelined design, and then discuss the changes we had to

make to achieve good performance.

In our Pipelined-MapReduce implementation , we modified Hadoop to send data directly

from map to reduce tasks. When a client submits a new job to Hadoop, the Job- Tracker

assigns the map and reduce tasks associated with the job to the available Task-Tracker

slots. For purposes of discussion, we assume that there are enough free slots to assign all

the tasks for each job. We modified Hadoop so that each reduce task contacts every map

task upon initiation of the job, and opens a TCP socket which will be used to pipeline

the output of the map function. As each map output record is produced, the mapper

determines which partition (reduce task) the record should be sent to, and immediately

sends it via the appropriate socket.

A reduce task accepts the pipelined data it receives from each map task and stores it

in an in-memory buffer, spilling sorted runs of the buffer to disk as needed. Once the

reduce task learns that every map task has completed, it performs a final merge of all

the sorted runs and applies the userdefined reduce function as normal, wirte the ouput

to the HDFS.

process was occurred rapidly. So make the siphonic pressure increase consumedly, and

enhance the siphonic power, attained the purpose of saving water.

The data we have for the experiment is enwiki- 20100904-pages-articles.xml[6]. enwiki-

20100904-pagesarticles. xml is contains the full content of the Wikipedia,which is a

around 6GB XML datasets. We implement the Naive Bayes algorithm[7] in the Hadoop

environment[8] for classify the XML datasets. First, the datasets divided into some

chunks, then chunks of datasets write into HDFS. Naive Bayes classifier consists of two

Page 25: Pipelined-MapReduce an Improved MapReduce

Chapter3 3. PROPOSED ARCHITECTURE

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1616

processes: tracking specific documents related to the characteristics and types, and then

use this model to predict the new documents, not containd the contents of the category.

The first step is called training, it has been classified by looking at the contents of the

sample to create a model, and then follow with specific content related to the probability

of each word. The second step is called classification, it will use the training model and

the content of the new documents, combined with the Bayes theorem to predict the of

new documents. We setting the test datasets , and use the the training model trained it

, then use the model to classify the new documents. After setting the the training and

test datasets, classified the test datasets. Therefore, to apply the Naive Bayes classifier

to the XML datasets, we need to train the training model , and then use the model to

classified the new documents.

In order to evaluate the performance of MapReduce technologies, we first tested different

implementations[9] of the environment increases with the amount of data the task exe-

cution time. Fig.2 depicts our results .Hadoop and Pipelined-MapReduce have almost

similar performance. May be subject to I/O bandwidth impact, performance, or less.

The consumption generated by the MapReduce implementation completed time for the

whole is small.

Page 26: Pipelined-MapReduce an Improved MapReduce

17

Chapter 4

Hadoop and HDFS

4.1 Hadoop System Architecture:

Each cluster has one master node with multiple slave nodes. The master node runs

Name Node and Job Tracker functions and coordinates with the slave nodes to get

the job done. The slaves run Task Tracker, HDFS to store data, and map and reduce

functions for data computation. The basic stack includes Hive and Pig for language

and compilers, HBase for NoSQL database management, and Scribe and Flume for log

collection. ZooKeeper provides centralized coordination for the stack.

Figure 4.1: Hadoop System Architecture

Page 27: Pipelined-MapReduce an Improved MapReduce

Chapter4 Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1818

4.1.1 Operating a Server Cluster:

A client submits a job to the master node, which orchestrates with the slaves in the

cluster. Job Tracker controls the Map Reduce job, reporting to Task Tracker. In the

event of a failure, Job Tracker reschedules the task on the same or a different slave node,

whichever is most efficient. HDFS is location-aware or rack-aware and manages data

within the cluster, replicating the data on various nodes for data reliability. If one of

the data replicas on HDFS is corrupted, Job Tracker, aware of where other replicas are

located, can reschedule the task right where it resides, decreasing the need to move data

back from one node to another. This saves network bandwidth and keeps performance

and availability high. Once the job is mapped, the output is sorted and divided into

several groups, which are distributed to reducers. Reducers may be located on the same

node as the mappers or on another node.

Figure 4.2: Operation Of Hadoop Cluster

4.2 Data Storage and Analysis

The problem is simple: while the storage capacities of hard drives have increased

massively over the years, access speeds the rate at which data can be read from drives

have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a

transfer speed of 4.4 MB/s, so you could read all the data from a full drive in around five

minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed

Page 28: Pipelined-MapReduce an Improved MapReduce

Chapter4 Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP1919

is around 100 MB/s, so it takes more than two and a half hours to read all the data off

the disk.

This is a long time to read all data on a single drive and writing is even slower. The

obvious way to reduce the time is to read from multiple disks at once. Imagine if we

had 100 drives, each holding one hundredth of the data. Working in parallel, we could

read the data in less than two minutes. Only using one hundredth of a disk may seem

wasteful. But we can store one hundred datasets, each of which is one terabyte, and

provide shared access to them. We can imagine that the users of such a system would be

happy to share access in return for shorter analysis times, and, statistically, that their

analysis jobs would be likely to be spread over time, so they wouldnt interfere with each

other too much. Theres more to being able to read and write data in parallel to or from

multiple disks, though. The first problem to solve is hardware failure: as soon as you

start using many pieces of hardware, the chance that one will fail is fairly high.

A common way of avoiding data loss is through replication: redundant copies of

the data are kept by the system so that in the event of failure, there is another copy

available. This is how RAID works, for instance, although Hadoops file system, the

Hadoop Distributed File system (HDFS), takes a slightly different approach, as you shall

see later. The second problem is that most analysis tasks need to be able to combine

the data in some way; data read from one disk may need to be combined with the data

from any of the other 99 disks. Various distributed systems allow data to be combined

from multiple sources, but doing this correctly is notoriously challenging. Map Reduce

provides a programming model that abstracts the problem from disk reads and writes,

transforming it into a computation over sets of keys and values.

This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis

system. The storage is provided by HDFS and analysis by Map Reduce. There are other

parts to Hadoop, but these capabilities are its kernel.

4.3 Comparison with Other Systems:

The approach taken by Map Reduce may seem like a brute-force approach. The

premise is that the entire dataset or at least a good portion of it is processed for each

query. But this is its power. Map Reduce is a batch query processor, and the ability to

run an ad hoc query against your whole dataset and get the results in a reasonable time

is transformative. It changes the way you think about data, and unlocks data that was

Page 29: Pipelined-MapReduce an Improved MapReduce

Chapter4 Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2020

previously archived on tape or disk. It gives people the opportunity to innovate with

data. Questions that took too long to get answered before can now be answered, which

in turn leads to new questions and new insights. For example, Mail trust, Rack spaces

mail division, used Hadoop for processing email logs. One ad hoc query they wrote was

to find the geographic distribution of their users. In their words: This data was so useful

that weve scheduled the Map Reduce job to run monthly and we will be using this data

to help us decide which Rack space data centers to place new mail servers in as we grow.

By bringing several hundred gigabytes of data together and having the tools to analyze

it, the Rack space engineers were able to gain an understanding of the data that they

otherwise would never have had, and, furthermore, they were able to use what they had

learned to improve the service for their customers.

4.3.1 RDBMS:

Why cant we use databases with lots of disks to do large-scale batch analysis? Why

is map Reduce needed? The answer to these questions comes from another trend in disk

drives: seek time is improving more slowly than transfer rate. Seeking is the process

of moving the disks head to a particular place on the disk to read or write data. It

characterizes the latency of a disk operation, whereas the transfer rate corresponds to a

disks bandwidth. If the data access pattern is dominated by seeks, it will take longer

to read or write large portions of the dataset than streaming through it, which operates

at the transfer rate. On the other hand, for updating a small proportion of records in

a database, a traditional B-Tree (the data structure used in relational databases, which

is limited by the rate it can perform, seeks) works well. For updating the majority of a

database, a B-Tree is less efficient than Map Reduce, which uses Sort/Merge to rebuild

the database. In many ways, Map Reduce can be seen as a complement to an RDBMS

Map Reduce is a good fit for problems that need to analyze the whole dataset, in a

batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or

updates, where the dataset has been indexed to deliver low-latency retrieval and update

times of a relatively small amount of data.

Map Reduce suits applications where the data is written once, and read many times,

whereas a relational database is good for datasets that are continually updated. Another

difference between Map Reduce and an RDBMS is the amount of structure in the datasets

that they operate on. Structured data is data that is organized into entities that have a

defined format, such as XML documents or database tables that conform to a particular

Page 30: Pipelined-MapReduce an Improved MapReduce

Chapter4 Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2121

predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other

hand, is looser, and though there may be a schema, it is often ignored, so it may be used

only as a guide to the structure of the data: for example, a spreadsheet, in which the

structure is the grid of cells, although the cells themselves may hold any form of data.

Unstructured data does not have any particular internal structure: for example, plain

text or image data. Map Reduce works well on unstructured or semi structured data,

since it is designed to interpret the data at processing time. In other words, the input

keys and values for Map Reduce are not an intrinsic property of the data, but they are

chosen by the person analyzing the data.

Relational data is often normalized to retain its integrity and remove redundancy.

Normalization poses problems for Map Reduce, since it makes reading a record a nonlocal

operation, and one of the central assumptions that Map Reduce makes is that it is possible

to perform (high-speed) streaming reads and writes. A web server log is a good example

of a set of records that is not normalized (for example, the client hostnames are specified

in full each time, even though the same client may appear many times), and this is one

reason that log files of all kinds are particularly well-suited to analysis with Map Reduce.

Map Reduce is a linearly scalable programming model. The programmer writes two

functions a map function and a reduce function each of which defines a mapping from one

set of key-value pairs to another. These functions are oblivious to the size of the data or

the cluster that they are operating on, so they can be used unchanged for a small dataset

and for a massive one. More important, if you double the size of the input data, a job

will run twice as slow. But if you also double the size of the cluster, a job will run as fast

as the original one. This is not generally true of SQL queries. Over time, however, the

differences between relational databases and Map Reduce systems are likely to blur both

as relational databases start incorporating some of the ideas from Map Reduce (such as

Aster Datas and Green plums databases) and, from the other direction, as higher-level

query languages built on Map Reduce (such as Pig and Hive) make Map Reduce systems

more approachable to traditional database programmers.

4.4 Hadoop Distributed File System (HDFS):

When a dataset outgrows the storage capacity of a single physical machine, it becomes

necessary to partition it across a number of separate machines. File systems that manage

the storage across a network of machines are called distributed file systems. Since they

Page 31: Pipelined-MapReduce an Improved MapReduce

Chapter4 Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2222

are network-based, all the complications of network programming kick in, thus making

distributed file systems more complex than regular disk file systems.

For example, one of the biggest challenges is making the file system tolerate node

failure without suffering data loss.Hadoop comes with a distributed file system called

HDFS, which stands for Hadoop Distributed File system. HDFS is Hadoops flagship file

system and is the focus of this chapter, but Hadoop actually has a general purpose file

system abstraction, so well see along the way how Hadoop integrates with other storage

systems (such as the local file system)

4.4.1 HDFS Federation:

The name node keeps a reference to every file and block in the file system in memory,

which means that on very large clusters with many files, memory becomes the limiting

factor for scaling adding name nodes, each of which manages a portion of the file system

namespace. For example, one name node might manage all the files rooted under /user,

say, and a second name node might handle files under /share.

Under federation, each name node manages a namespace volume, which is made up

of the metadata for the namespace, and a block pool containing all the blocks for the files

in the namespace. Namespace volumes are independent of each other, which mean name

nodes do not communicate with one another, and furthermore the failure of one name

node does not affect the availability of the namespaces managed by other name nodes.

Block pool storage is not partitioned, however, so data nodes register with each name

node in the cluster and store blocks from multiple block pools. To access a federated

HDFS cluster, clients use client-side mount tables to map file paths to name nodes. This

is managed in configuration using the view File System, and viewfs: // URIs.:

4.4.2 HDFS High-Availability

The combination of replicating name node metadata on multiple file systems, and

using the secondary name node to create checkpoints protects against data loss, but does

not provide high-availability of the file system. The name node is still a single point

of failure (SPOF), since if it did fail, all clients including Map Reduce jobs would be

unable to the file-to-block mapping. In such an event the whole Hadoop system would

effectively be out of service until a new name node could be brought online. To recover

from a failed name node in this situation, an administrator starts a new primary name

Page 32: Pipelined-MapReduce an Improved MapReduce

Chapter4 Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2323

node with one of the file system metadata replicas, and configures ata nodes and clients

to use this new name node. The new name node is not able to serve requests until it

has i) loaded its namespace image into memory, ii) replayed its edit log, and iii) received

enough block reports from the data nodes to leave safe mode. On large clusters with

many files and blocks, the time it takes for a name node to start from cold can be 30

minutes or more. The long recovery time is a problem for routine maintenance too. In

fact, since unexpected failure of the name node is so rare, the case for planned downtime

is actually more important in practice. The 0.23 release series of Hadoop remedies this

situation by adding support for HDFS high-availability (HA).

In this implementation there is a pair of name nodes in an active standby config-

uration. In the event of the failure of the active name node, the standby takes over

its duties to continue servicing client requests without a significant interruption. A few

architectural changes are needed to allow this to happen:

The name nodes must use highly-available shared storage to share the edit log. (In the

initial implementation of HA this will require an NFS filer, but in future releases more

options will be provided, such as a Bookkeeper-based system built on Zoo- Keeper.)

When a standby name node comes up it reads up to the end of the shared edit log to

synchronize its state with the active name node, and then continues to read new entries

as they are written by the active name node.

Data nodes must send block reports to both name nodes since the block mappings are

stored in a name nodes memory, and not on disk. Clients must be configured to handle

name node failover, which uses a mechanism that is transparent to users. If the active

name node fails, then the standby can take over very quickly (in a few tens of seconds)

since it has the latest state available in memory: both the latest edit log entries, and an

up-to-date block mapping.

The actual observed failover time will be longer in practice (around a minute or so),

since the system needs to be conservative in deciding that the active name node has

failed. In the unlikely event of the standby being down when the active fails, the admin-

istrator can still start the standby from cold. This is no worse than the non-HA case, and

from an operational point of view its an improvement, since the process is a standard

operational procedure built into Hadoop.

Page 33: Pipelined-MapReduce an Improved MapReduce

Chapter4 Hadoop and HDFS

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2424

4.5 Algorithm:

4.5.1 Algorithm For Word Count:

Example: Wordcount Mapper

public static class MapClass extends MapReduceBase

implements Mapper¡LongWritable, Text, Text, IntWritable¿

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector¡Text, IntWritable¿ output,

Reporter reporter) throws IOException

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens())

word.set(itr.nextToken());

output.collect(word, one);

Example: Wordcount Reducer

public static class Reduce extends MapReduceBase

implements Reducer¡Text, IntWritable, Text, IntWritable¿

public void reduce(Text key, Iterator¡IntWritable¿ values,

OutputCollector¡Text, IntWritable¿ output,

Reporter reporter) throws IOException

int sum = 0;

while (values.hasNext())

sum += values.next().get();

output.collect(key, new IntWritable(sum));

Page 34: Pipelined-MapReduce an Improved MapReduce

25

Chapter 5

ADVANTAGES, FEATURES AND

APPLICATIONS

5.1 Advantages

The design has the following advantages as compared to the existing implementations

surveyed above: 1. Either design allows Reducers to start processing as soon as data is

made available by the Mappers. This allows parallelism between the Map and Reduce

phases. 2. A downstream processing element can start processing as soon as some

data is available from an upstream element. 3. The network is better utilized as data

is continuously pushed from one phase to the next. 4. The final output is computed

incrementally. 5. Introduction of a pipeline between the Reduce phase of one job and

the Map phase of the next job will support Cascaded MapReduce jobs. the delays

experienced between the jobs. A module that pushes data produced by one job, into the

InputQueue of the next job will pipeline and parallelise the operation of the two jobs. A

typical application is database queries where for example, the output of a join operation

may be required as the input to a grouping operation.

5.2 Features

5.2.1 Time windows

The user can define a time-window for processing where the user can specify the range

of time values over which he wishes to calculate the output for the job. For e.g. the user

could specify that he wishes to run some query X on the click-stream data from 24 hrs

Page 35: Pipelined-MapReduce an Improved MapReduce

Pipelined-MapReduce: An improved MapReduce Parallel programing model P26

Chapter5 ADVANTAGES, FEATURES AND APPLICATIONS

ago, till present time. This will use the timestamp metadata added to the input record

and process only those records that lie within the timestamp range.

5.2.2 Snapshots

Incremental processing of data can allow users to take ’snapshots’ of partial output

generated, based on the job running time and/or the percentage of job completed. For

eg. In the word count example, the user could request a snapshot of the data processed

till time t=15minutes from start of job. A snapshot will essentially be a snapshot of the

output folder of S3 where the output is being collected. Snapshots allow users to get an

approximate idea of the output without having to wait for all processing to be done till

the final output is available.

5.2.3 Cascaded MapReduceJobs

Cascaded MapReduce jobs are those where the output of one MapReduce job is

to be used by the next MapReduce job. Pipelining between jobs further increases the

throughput while decreasing popular and typical stream processing applications. With

accurate delay guarantees, this design can also be used to process real-time data.

Page 36: Pipelined-MapReduce an Improved MapReduce

Pipelined-MapReduce: An improved MapReduce Parallel programing model P26

Page 37: Pipelined-MapReduce an Improved MapReduce

2727

Conclusion

As described above, the design fulfills a real need of processing streaming data using

MapReduce. It is also inherently scalable as it is cloud-based. This also gives it a light-

weight? nature, as the handling of distributed resources is done by the Cloud OS. The

gains due to parallelism are expected to be similar to, or better than those obtained in [3].

Currently, the authors are working on implementing the said design. Future work will

include supporting rollingwindows for obtaining outputs of arbitrary time-intervals of the

input stream. Further work can be done in maintaining intermediate output information

for supporting rolling windows. Also, reducing the delays to levels acceptable for real-

time data is an important usecase that needs to be supported. Future scope also includes

designing a generic system that is portable across several cloud operating systems.

Page 38: Pipelined-MapReduce an Improved MapReduce

2828

Bibliography

[1] Jeffrey Dean and Sanjay Ghemawat: MapReduce Simplified Data Processing on

Large Clusters in OSDI, 2004.

[2] Huan Liu, Dan Orban: Cloud MapReduce: a MapReduce Implementation on top

of a Cloud Operating System in , Accenture Technology Labs, Cluster, Cloud and

Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium.

[3] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy,

Russell Sears: MapReduce Online in 7th USENIX conference on Networked systems

design and implementation 2010

[4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph,R. H. Katz, A. Konwinski, G.

Lee, D. A. Patterson,A. Rabkin, I. Stoica, and M. Zaharia. Above theclouds: A

berkeley view of cloud computing in TechnicalReport UCB/EECS-2009-28, EECS

Department,University of California, Berkeley, Feb 2009.

[5] Daniel Warneke and Odej Kao: Exploiting Dynamic Resource Allocation for Effi-

cient Parallel Data Processing in the Cloud in IEEE Transactions on parallel and

distributed systems, 2011

[6] Shrideep Pallickara, Jaliya Ekanayake and Geoffrey Fox: Granules- A Lightweight,

Streaming Runtime for Cloud Computing With Support for MapReduce in Cluster

Computing and Workshops, CLUSTER’09, IEEE International Conference.

[7] Floreen P, Przybilski M, Nurmi P, Koolwaaij J, Tarlano A, Wagner M, Luther M,

Bataille F, Boussard M, Mrohs B, et al 2005). Towards a context management

framework for mobiLife. 14th IST Mobile Wireless Summit 7.

[8] Nathan Backman, Karthik Pattabiraman, Ugur Cetintemel: C-MR: A Continuous

MapReduce Processing Model for Low-Latency Stream Processing on Multi-Core

Architectures in Department of Computer Science, Brown University.

Page 39: Pipelined-MapReduce an Improved MapReduce

Chapter5 BIBLIOGRAPHY

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP2929

[9] Zikai Wang: A Distributed Implementation of Continuous- MapReduce Stream Pro-

cessing Framework in Department of Computer Science, Brown University.

[10] Huan Liu: Cutting MapReduce Cost with Spot Market In Proc. Of Usenix HotCloud

(2011).

[11] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica: Im-

proving MapReduce Performance in Heterogeneous Environments In proceedings of

8th Usenix Symposium on Operating Systems Design and Implementation.

[12] Michael Stonebraker, Ugur etintemel, Stan Zdonik: The 8 Requirements of Real-

Time Stream Processing in ACM SIGMOD Record Volume 34 Issue 4, 2005

[13] Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.

Twister: A runtime for iterative MapReduce In The First International Workshop

on MapReduce and its Applications (2010).

[14] Hadoop. http://hadoop.apache.org/

[15] Amazon Elastic MapReduce aws.amazon.com/elasticmapreduce/

[16] Gunho Leey, Byung-Gon Chunz, Randy H. Katzy: Heterogeneity- Aware Resource

Allocation and Scheduling in the Cloud in ,Usenix HotCloud 2011

[17] Amazon ec2. http://aws.amazon.com/ec2

[18] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S.

Shenker, and I. Stoica. Mesos: A platform for finegrained resource sharing in the data

center In USENIX symposium on Networked Systems Design and Implementation,

2011.

[19] J. Polo, D. Carrera, Y. Becerra, V. Beltran, and J. T. andEduard Ayguad. Perfor-

mance management of accelerated mapreduce workloads in heterogeneous clusters

In 39th International Conference on Parallel Processing (ICPP2010), 2010.

[20] Pramod Bhatotia Alexander Wieder ?Istemi Ekin Akkus Rodrigo Rodrigues Umut

A. Acar Large-scale Incremental Data Processing with Change Propagation Max

Planck Institute for Software Systems (MPI-SWS)

Page 40: Pipelined-MapReduce an Improved MapReduce

Chapter5 BIBLIOGRAPHY

Pipelined-MapReduce: An improved MapReduce Parallel programing modelP3030

[21] D., Olston, C., Reed, B., Webb, K. C., and Yocum, K. LOGOTHETIS Stateful

bulk processing for incremental analytics In Proc. 1st Symp. on Cloud computing

(SoCC10).

[22] Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom Mod-

els and Issues in Data Stream Systems Department of Computer Science Stanford

University