This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
As an example of the usefulness of parallelisation, consider:Bayes‘ formula and its use for classification
1. Joint probabilities and conditional probabilities: basics P(A & B) = P(A|B) * P(B) = P(B|A) * P(A) P(A|B) = ( P(B|A) * P(A) ) / P(B) (Bayes´ formula) P(A) : prior probability of A (a hypothesis, e.g. that an object belongs to a
certain class) P(A|B) : posterior probability of A (given the evidence B)
2. Estimation: Estimate P(A) by the frequency of A in the training set (i.e., the number of A
instances divided by the total number of instances) Estimate P(B|A) by the frequency of B within the class-A instances (i.e., the
number of A instances that have B divided by the total number of class-A instances)
3. Decision rule for classifying an instance: If there are two possible hypotheses/classes (A and ~A), choose the one that is
more probable given the evidence (~A is „not A“) If P(A|B) > P(~A|B), choose A The denominators are equal If ( P(B|A) * P(A) ) > ( P(B|~A) * P(~A) ), choose A
D. J. DeWitt & M. Stonebraker (2008). MapReduce: A major step backwards.http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html (no longer available). Online 1/11/09 at http://www.yjanboo.cn/?p=237
(reproduced at http://craig-henderson.blogspot.com/2009/11/dewitt-and-stonebrakers-mapreduce-major.html)
„MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:
A giant step backward in the programming paradigm for large-scale data intensive applications
A sub-optimal implementation, in that it uses brute force instead of indexing
Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
Missing most of the features that are routinely included in current DBMS
Incompatible with all of the tools DBMS users have come to depend on“
2 more publications by teams including these authors (2009; 2010: s.a.)
As an aside: Can text mining / information retrieval help to learn about (or even solve) this controversy? ;-)
HCIR 2011 Challenge
„1) The Great MapReduce Debate
In 2004, Google introduced MapReduce as a software framework to support distributed computing on large data sets on clusters of computers. In the "Map" step, the master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. The worker node processes that smaller problem, and passes the answer back to its master node. In the "Reduce" step, the master node then takes the answers to all the sub-problems and combines them to obtain the final output.
In a blog post, David J. DeWitt and Michael Stonebraker asserted that MapReduce was not novel -- that the techniques employed by MapReduce are more than 20 years old. Use your [information retrieval] system to either support DeWitt and Stonebraker's case or to argue that a thorough search of the literature does not yield examples that support their case.“
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178. http://db.csail.mit.edu/pubs/benchmarks-sigmod09.pdf
Michael Stonebraker, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin (2010). MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 53(1), 64-71. http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf
J. Dean & S. Ghemawat (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72-77.