Top Banner
CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota
8

CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

Jan 16, 2016

Download

Documents

Mervin Joseph
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

CSci 5707, Fall 2013

MapReducevs.

Parallel DBMS

Hamid Safizadeh, Otelia Buffington

University of Minnesota

Page 2: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

2

MapReduce Idea

Mapping

map (k1, v1)

list (k2, v2)

Reducing

reduce (k2, list(v2))

list (v2)

Pseudo-code for counting the number of occurrences of each word in a large collection of documents

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

Page 3: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

3

MapReduce Example

Calculation of the number of occurrences of each word

http://aimotion.blogspot.com/2010/08/mapreduce-with-mongodb-and-python.html

Page 4: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

4

MapReduce Architecture

Execution overview

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

Page 5: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

5

MapReduce or Parallel DBMS

Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., and

Stonebraker, M., “A comparison of approaches to large-scale data analysis”,

ACM SIGMOD International Conference, 2009

(http://database.cs.brown.edu/projects/mapreduce-vs-dbms)

Dean, J., and Ghemawat, S., “MapReduce: A flexible data processing tool”,

Communications of the ACM, Vol. 53, 2010 (DOI: 10.1145/1629175.1629198)

Page 6: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

MapReduce Design Properties

6

Heterogeneous Systems Processing and combining data from a wide variety of storage systems

(such as relational databases, file systems, etc.)

Fault Tolerance Providing fine-grain fault tolerance for large jobs (Failure in middle of a

multi-hour execution does not require restarting the job from scratch)

Complex Functions Simple Map and Reduce functions with straightforward SQL equivalents

Offering a better framework for some complicated tasks

Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

Page 7: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

MapReduce Design Properties

7

Performance Loading data: Startup overhead for MapReduce

Reading data: Full scan over large data files

Merging results: A MapReduce as the next consumer

Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

Cost Hardware: Network workstations

Software: Open source (Hodoop)

Communication: Network system

Page 8: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota.

Companies Using Hodoop

8

Facebook Yahoo! Google Amazon Twitter