Document Classification. MapReduce. Software Transactional Memory Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior T´ ecnico December 6, 2010 Jos´ e Monteiro & Jos´ e Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 1 / 48
78
Embed
Document Classification. MapReduce. Software · PDF fileMapReduce Examples Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 9 / 48
What can you do with it?
Seems like a limited model.
But...
Many string processing problems fit naturally
Can be used iteratively
MapReduce libraries have been written in C++, C#, Erlang, Java,Python, F#, R and other programming languages.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 10 / 48
What can you do with it?
Seems like a limited model.
But...
Many string processing problems fit naturally
Can be used iteratively
MapReduce libraries have been written in C++, C#, Erlang, Java,Python, F#, R and other programming languages.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 10 / 48
What can you do with it?
Seems like a limited model.
But...
Many string processing problems fit naturally
Can be used iteratively
MapReduce libraries have been written in C++, C#, Erlang, Java,Python, F#, R and other programming languages.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 10 / 48
What can you do with it?
Seems like a limited model.
But...
Many string processing problems fit naturally
Can be used iteratively
MapReduce libraries have been written in C++, C#, Erlang, Java,Python, F#, R and other programming languages.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 10 / 48
Example: Counting Words in Web Pages
Input: files with one document per record
Specify a map function that takes a key/value pair where
key = document URL
value = document contents
Output of map function is (potentially many) key/value pairs. In our case,output (word, “1”) once per word in the document.
Example:
If we have as input “document1” and “to be or not to be”
we get as output the following key/value pairs:“to”, “1”“be”, “1”“or”, “1”
“not”, “1”“to”, “1”“be”, “1”
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 11 / 48
Example: Counting Words in Web Pages
Input: files with one document per record
Specify a map function that takes a key/value pair where
key = document URL
value = document contents
Output of map function is (potentially many) key/value pairs. In our case,output (word, “1”) once per word in the document.
Example:
If we have as input “document1” and “to be or not to be”
we get as output the following key/value pairs:“to”, “1”“be”, “1”“or”, “1”
“not”, “1”“to”, “1”“be”, “1”
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 11 / 48
Counting Words in Web Pages
MapReduce library gathers together all pairs with the same key.
We must specify a reduce function that combines the values for a key.
Example:
Compute the sum of the values for the different keys:key = “be”values = “1”, “1”
key = “not”values = “1”
key = “or”values = “1”
key = “to”values = “1”, “1”
Output of reduce paired with key:“be”, “2”“not”, “1”
“or”, “1”“to”, “2”
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 12 / 48
Counting Words in Web Pages
MapReduce library gathers together all pairs with the same key.
We must specify a reduce function that combines the values for a key.
Example:
Compute the sum of the values for the different keys:key = “be”values = “1”, “1”
key = “not”values = “1”
key = “or”values = “1”
key = “to”values = “1”, “1”
Output of reduce paired with key:“be”, “2”“not”, “1”
“or”, “1”“to”, “2”
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 12 / 48
Counting Words in Web Pages
MapReduce library gathers together all pairs with the same key.
We must specify a reduce function that combines the values for a key.
Example:
Compute the sum of the values for the different keys:key = “be”values = “1”, “1”
key = “not”values = “1”
key = “or”values = “1”
key = “to”values = “1”, “1”
Output of reduce paired with key:“be”, “2”“not”, “1”
“or”, “1”“to”, “2”
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 12 / 48
MapReduce Execution Overview
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 13 / 48
MapReduce Examples
Distributed Grep: The map function emits a line if it matches a given pattern.The reduce function is an identity function that just copies the suppliedintermediate data to the output.
Count of URL Access Frequency: The map function processes logs of web pagerequests and outputs <URL, 1>. The reduce function adds together all values forthe same URL and emits a <URL, total count> pair.
Reverse Web-Link Graph: The map function outputs <target, source> pairs foreach link to a target URL found in a page named “source”. The reduce functionconcatenates the list of all source URLs associated with a given target URL andemits the pair: <target, list(source)>.
Inverted Index: The map function parses each document, and emits a sequence of
<word, document ID> pairs. The reduce function accepts all pairs for a given
word, sorts the corresponding document IDs and emits a <word, list(document
ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to
augment this computation to keep track of word positions.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 14 / 48
MapReduce Examples
Distributed Grep: The map function emits a line if it matches a given pattern.The reduce function is an identity function that just copies the suppliedintermediate data to the output.
Count of URL Access Frequency: The map function processes logs of web pagerequests and outputs <URL, 1>. The reduce function adds together all values forthe same URL and emits a <URL, total count> pair.
Reverse Web-Link Graph: The map function outputs <target, source> pairs foreach link to a target URL found in a page named “source”. The reduce functionconcatenates the list of all source URLs associated with a given target URL andemits the pair: <target, list(source)>.
Inverted Index: The map function parses each document, and emits a sequence of
<word, document ID> pairs. The reduce function accepts all pairs for a given
word, sorts the corresponding document IDs and emits a <word, list(document
ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to
augment this computation to keep track of word positions.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 14 / 48
MapReduce Examples
Distributed Grep: The map function emits a line if it matches a given pattern.The reduce function is an identity function that just copies the suppliedintermediate data to the output.
Count of URL Access Frequency: The map function processes logs of web pagerequests and outputs <URL, 1>. The reduce function adds together all values forthe same URL and emits a <URL, total count> pair.
Reverse Web-Link Graph: The map function outputs <target, source> pairs foreach link to a target URL found in a page named “source”. The reduce functionconcatenates the list of all source URLs associated with a given target URL andemits the pair: <target, list(source)>.
Inverted Index: The map function parses each document, and emits a sequence of
<word, document ID> pairs. The reduce function accepts all pairs for a given
word, sorts the corresponding document IDs and emits a <word, list(document
ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to
augment this computation to keep track of word positions.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 14 / 48
MapReduce Examples
Distributed Grep: The map function emits a line if it matches a given pattern.The reduce function is an identity function that just copies the suppliedintermediate data to the output.
Count of URL Access Frequency: The map function processes logs of web pagerequests and outputs <URL, 1>. The reduce function adds together all values forthe same URL and emits a <URL, total count> pair.
Reverse Web-Link Graph: The map function outputs <target, source> pairs foreach link to a target URL found in a page named “source”. The reduce functionconcatenates the list of all source URLs associated with a given target URL andemits the pair: <target, list(source)>.
Inverted Index: The map function parses each document, and emits a sequence of
<word, document ID> pairs. The reduce function accepts all pairs for a given
word, sorts the corresponding document IDs and emits a <word, list(document
ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to
augment this computation to keep track of word positions.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 14 / 48
MapReduce Fault Tolerance
On worker failure:
Detect failure via periodic heartbeats
Re-execute completed and in-progress map tasks
Re-execute in progress reduce tasks
Task completion committed through master
Master failure:
Could handle, but don’t yet (master failure unlikely)
Robust: lost 1600 of 1800 machines once, but finished fine
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 15 / 48
MapReduce Use in Industry
Introduced by Google
Yahoo! is running on a 10k Linux cluster with 5 Petabytes of data
Amazon is leasing servers to run map reduce computations
Microsoft is developing Dryad to supersede Map-Reduce
Facebook, Twitter and others are also using Map-Reduce
Hadoop is an open source implementation of MapReduce.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 16 / 48
Hadoop
Hadoop is a software platform written in Java that lets one easily writeand run applications that process vast amounts of data.
Hadoop is sub-project of the Apache foundation and receives sponsorshipfrom Google, Yahoo, Microsoft, HP and others.
Scalable: Hadoop can reliably store and process Petabytes
Economical: it distributes the data and processing across clusters ofcommonly available computers
these clusters can number into the thousands of nodes.
Efficient: By distributing the data, Hadoop can process it in parallelon the nodes where the data is located
This makes it extremely efficient.
Reliable: Automatically maintains multiple copies of data andautomatically redeploys computing tasks based on failures
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 17 / 48
MapReduce Example using Hadoop
/**
* Counts the words in each line.
* For each line of input, break the line into words and emit them as (word, 1).
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 20 / 48
Meanwhile in Google...
MapReduce would receive the epic amounts of webpage datacollected by Google’s crawlers, and it would crunch this down to thelinks and metadata needed to actually search these pages.
The whole process would take 8 hours and then it had to be startedall over again. In the age of the “real time” web that is too long...
On Setember 2010 Google switched its search infrastructure toCaffeine
indexes are updated by making direct changes to the web map alreadystored in databaseCaffeine its completely incremental
With Caffeine, Google moved its back-end indexing system away fromMapReduce and onto BigTable, a fast, extremely large-scale,distributed DBMS developed by Google.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 21 / 48
Critical Regions
Critical Region
Sections of the code that access a shared resource which must not beaccessed concurrently by another thread.
all threads must check the state of the mutex before entering thecritical region.
if the mutex is locked, then there is a thread in the critical sectionand this thread blocks, waiting on the mutex
if mutex is unlocked, then no thread is currently in the critical section
this thread is allowed to enter, simultaneously locking the mutexthread unlocks the mutex when exiting the critical section, waking upany thread waiting on this mutex
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 23 / 48
Limitations of Mutual Exclusion
Deadlocks
Priority inversion
Relies on conventions
Conservative
Not Composable
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 24 / 48
Limitations of Mutual Exclusion
Priority inversion
low priority task may hold a shared resource
high priority tasks get blocked if they request the same resource
intermediate priority tasks preempt the low priority task that holdsthe resource
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 25 / 48
Limitations of Mutual Exclusion
Relies on conventions
Relationship between lock and shared data is in programmer’s mind.
Actual comment from Linux kernel:
/** When a locked buffer is visible to the I/O layer* BH_Launder is set. This means before unlocking* we must clear BH_Launder,mb() on alpha and then* clear BH_Lock, so no reader can see BH_Launder set* on an unlocked buffer and then risk to deadlock.*/
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 26 / 48
Limitations of Mutual Exclusion
Conservative
There might be a conflict, be on the safe side.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 27 / 48
Limitations of Mutual Exclusion
Conservative
There might be a conflict, be on the safe side.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 27 / 48
Limitations of Mutual Exclusion
Conservative
There might be a conflict, be on the safe side.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 27 / 48
Limitations of Mutual Exclusion
Not Composable
Operation: move item from Hash Table T1 to Hash Table T2.
Implementation:
delete(T1, item);add(T2, item);
Both delete and add may have been protected as critical sections,however externally the situation where item is in neither Hash Tables willbe visible and interruptible.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 28 / 48
Transactions
Transaction
Operations in a transaction either all occur or none occur.
Atomic operation:
Commit: takes effect
Abort: effects rolled back
Usually retried
Linearizable
Appear to happen in one-at-a-time order
Transactional Memory
A section of code with reads and writes to shared memory which logicallyoccur at a single instant in time.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 29 / 48
Transactions
Transaction
Operations in a transaction either all occur or none occur.
Atomic operation:
Commit: takes effect
Abort: effects rolled back
Usually retried
Linearizable
Appear to happen in one-at-a-time order
Transactional Memory
A section of code with reads and writes to shared memory which logicallyoccur at a single instant in time.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 29 / 48
Software Transactional Memory
Software Transactional Memory (STM) has been proposed as analternative to Lock-based Synchronization.
Concurrency Unlocked
no thread control when entering critical regions
if there are no memory access conflicts during thread execution,operations executed by thread are accepted
in case of conflict, program state is rolled-back to the state it wasbefore the thread entered the critical region
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 30 / 48
Software Transactional Memory
Benefits of STM
Optimistic: increased concurrency
Composable: define atomic set of operations
Conditional Critical Regions
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 31 / 48
Optimistic
Increased concurrency: threads are not blocked.
Conflicts only arise when more than one thread makes an access to thesame memory position.
Conflicts are rare ⇒ small number of roll-backs.
Jose Monteiro & Jose Costa (DEI / IST) Parallel and Distributed Computing – 22 2010-12-06 32 / 48
Composable Atomic Operations
Keyword atomic allows the definition of the set of operations that makeup the transaction.