Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

1

Map-Reduceexamples

2So, what is it?

• A two phase process geared toward optimizing broad, widely distributed parallel computing platforms

• Apache Hadoop is a MapReduce file system.• MapReduce is Googles version (and it is proprietary).• Phases• 1. Take a series of keys and transform them into a different

series of values, generally, ones that have some semantic context

• 2. Perform a second pass where the new series of values are compressed into far fewer values

3

In its strictest sense…

• Map-reduce is a two phase operation• First, convert a list of data into a list of a different kind of

data• Second, turn the second list into a single or a list of scalar

values, often the cardinality of the items created in the first step

4

Relevant computing/data application types

• For aggregate database processing, and not so much for set-oriented, and certainly not for object-based querying

• Fits well with cluster-based environments, where there are lots of opportunities for parallel processing

• Fits query patterns that calculate the cardinality of sets and the removal of duplicates

5Strategy for M-R

• We try to do the computing on the machines where the data sits

• So we try to engineer the storage of data so that it accommodates the chaining of M-R operations

6

The key bottom line concept

• In a relational database, we try to minimize the I/O costs of moving large volumes of data from the server to the client, so that it can then be scanned and aggregated

• In a database that supports MP, we trying to screen (and sometimes aggregate) data on the server where it sits

• We also use parallel processing within cluster servers to minimize the cost of doing that aggregation if it cannot all be done on a single server housing the original data.

7

Another way of looking at this…

• We have seen the tradeoff between moving data and moving processing logic in the context of distributed, homogenous distributed data

• Often, in distributed databases, it is far cheaper to ship processing logic instead of data, even if it causes extra processing to have to happen

• This is another context in which we often choose to send processing code to a server in order to minimize the movement of large volumes of data

8Example

• 1. We start with a set of person keys and map each of these to the names of the people.• Key 1 -> Harry• Key 2 -> Harry• Key 3 -> Tommy

• 2. We aggregate the list of people names by counting how many unique names are in the list.• Harry, Harry, Tommy -> 2

9

What actually happens?

• Informally: • Each key leads to a name field.• Then, the names are isolated.• Then, each is passed to a “mapper”, which returns the name,

along with a 1. • Then, a “reducer” takes each name and makes a list of 1’s.

The reducer adds up the 1’s for each name and returns a list of (name, count) pairs.

10From wikipedia

Imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age.

SELECT age AS Y, AVG(contacts) AS A

FROM social.person GROUP BY age ORDER BY age

function Map is

input: integer K1 between 1 and 1100, representing a batch of 1 million social.person records

for each social.person record in the K1 batch do

let Y be the person's age

let N be the number of contacts the person has

produce one output record <Y,N>

repeat

end function

function Reduce is

input: age (in years) Y

for each input record <Y,N> do

Accumulate in S the sum of N

Accumulate in C the count of records so far

repeat

let A be S/C

produce one output record <Y,A>

end function

11

From NoSQL Distilled:1. Creating a list with a map

12

2. Aggregating with a reduce

13

3. Partitioning the output of mappers: parallelism & adding a phase that merges the results of the reducers

14

4. Introducing a combiner operation to minimize the movement of redundant data – the output format must be the same as the input format

15

5. A combiner that removes duplicate product-customer pairs

16

6. Concatenating a combining and a reduce (counting) operation

17

7. Maintaining the 1’s counts in the mapping phase

18

8. Adding temporal information to the map/reduce process

19

9. Using a reduce operator to create product per month totals

20

10. A second mapper that creates base year by year comparisons

21

11. A reduce operation combines records for a given year

22Complaints

• M-R is low level.• It is rigid.• It exists to optimize the distributed cluster model – only.• It demands that an application fit perfectly into the

paradigm.• It takes careful planning and knowledge of exactly how

the data will be used to structure the database to optimally serve a series of map/reduce operations

• It thus does not accommodate on-the-fly browsing

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

Documents

list of data

moving data

list of people names

original data

data sitsso

storage of data

processing logic

list of scalar values