Top Banner
Map-Reduce examples 1
22

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

Jan 19, 2016

Download

Documents

Piers Ramsey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

1

Map-Reduceexamples

Page 2: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

2So, what is it?

• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms

• Apache Hadoop is a MapReduce file system.• MapReduce is Googles version (and it is proprietary).• Phases• 1. Take a series of keys and transform them into a different

series of values, generally, ones that have some semantic context

• 2. Perform a second pass where the new series of values are compressed into far fewer values

Page 3: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

3

In its strictest sense…

• Map-reduce is a two phase operation• First, convert a list of data into a list of a different kind of

data• Second, turn the second list into a single or a list of scalar

values, often the cardinality of the items created in the first step

Page 4: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

4

Relevant computing/data application types

• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying

• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing

• Fits query patterns that calculate the cardinality of sets and the removal of duplicates

Page 5: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

5Strategy for M-R

• We try to do the computing on the machines where the data sits

• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations

Page 6: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

6

The key bottom line concept

• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated

• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits

• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.

Page 7: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

7

Another way of looking at this…

• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data

• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen

• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data

Page 8: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

8Example

• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy

• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2

Page 9: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

9

What actually happens?

• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which returns the name,

along with a 1. • Then, a “reducer” takes each name and makes a list of 1’s.

The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.

Page 10: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

10From wikipedia

Imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age.

SELECT age AS Y, AVG(contacts) AS A

FROM social.person GROUP BY age ORDER BY age

function Map is

input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records

for each social.person record in the K1 batch do

let Y be the person's age

let N be the number of contacts the person has

produce one output record <Y,N>

repeat

end function

function Reduce is

input: age (in years) Y

for each input record <Y,N> do

Accumulate in S the sum of N

Accumulate in C the count of records so far

repeat

let A be S/C

produce one output record <Y,A>

end function

Page 11: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

11

From NoSQL Distilled:1. Creating a list with a map

Page 12: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

12

2. Aggregating with a reduce

Page 13: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

13

3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers

Page 14: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

14

4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the input format

Page 15: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

15

5. A combiner that removes duplicate product-customer pairs

Page 16: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

16

6. Concatenating a combining and a reduce (counting) operation

Page 17: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

17

7. Maintaining the 1’s counts in the mapping phase

Page 18: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

18

8. Adding temporal information to the map/reduce process

Page 19: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

19

9. Using a reduce operator to create product per month totals

Page 20: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

20

10. A second mapper that creates base year by year comparisons

Page 21: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

21

11. A reduce operation combines records for a given year

Page 22: Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

22Complaints

• M-R is low level.• It is rigid.• It exists to optimize the distributed cluster model – only.• It demands that an application fit perfectly into the

paradigm.• It takes careful planning and knowledge of exactly how

the data will be used to structure the database to optimally serve a series of map/reduce operations

• It thus does not accommodate on-the-fly browsing