Top Banner
MapReduce VS Parallel DBMSs Presenter: Ran Ding
23

MapReduce VS Parallel DBMSs

Feb 23, 2016

Download

Documents

Anita Fedora

MapReduce VS Parallel DBMSs. Presenter: Ran Ding. G uideline. 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. C onclusion. Introduction-----MR. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MapReduce  VS Parallel DBMSs

MapReduce VS Parallel DBMSs

Presenter: Ran Ding

Page 2: MapReduce  VS Parallel DBMSs

1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. Conclusion

Guideline

Page 3: MapReduce  VS Parallel DBMSs

The MapReduce (MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access.

Like Hadoop

Introduction-----MR

Page 4: MapReduce  VS Parallel DBMSs

Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.

Introduction----Parallel DBMS

Page 5: MapReduce  VS Parallel DBMSs

Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.

Introduction---Horizontal partitioning

Page 6: MapReduce  VS Parallel DBMSs

One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query.

Like hash, range, and round-robin…..

Introduction---DBMS

Page 7: MapReduce  VS Parallel DBMSs

It is not easy!!!!!! UDF(user defined field) helps. Like GROUP BY in SQL.

Introduction-- Mapping parallel DBMS onto MapReduce

Page 8: MapReduce  VS Parallel DBMSs

1. ETL and “read once” data sets 2. Complex analytics 3. Semi-structured data 4. Quick-and-dirty analyses 5. Limited-budget operations

Where the MR wins

Page 9: MapReduce  VS Parallel DBMSs

Extract-transform-load system

MR system can be considered a general-purpose parallel ETL system.

DBMSs may perform the ETL

ETL and “read once” data sets

Page 10: MapReduce  VS Parallel DBMSs

Cannot be structured as single SQL aggregate queries

MR is a good candidate

Complex analytics

Page 11: MapReduce  VS Parallel DBMSs

MR systems are good at processing the data is prepared for loading into a back-end system

DBMS requires wide tables with many attributes

Plus, MR-style systems are easily store and process

Semi-structured data

Page 12: MapReduce  VS Parallel DBMSs

DBMS need the programmer write the schema then load

MR just copy!

Quick-and-dirty analyses

Page 13: MapReduce  VS Parallel DBMSs

MR is basically open source for free

Parallel DBMS: huge cost

Limited-budget operations

Page 14: MapReduce  VS Parallel DBMSs

DBMS “Sweet Spot” Test

Page 15: MapReduce  VS Parallel DBMSs

1. Repetitive record parsing 2. Compression 3. Pipelining 4. Scheduling 5. Column-oriented storage

Why the Parallel DBMS wins

Page 16: MapReduce  VS Parallel DBMSs

Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type

Records are parsed by DBMSs when the data is initially loaded.

Repetitive record parsing

Page 17: MapReduce  VS Parallel DBMSs

It is hard to say…….. Commercial DBMSs may use carefully tuned

compression algorithms

Compression

Page 18: MapReduce  VS Parallel DBMSs

In parallel DBMS, data is streamed from producer to consumer

the intermediate data is never written to disk

In MR system, it writes the result to local data structure, and consumers read from it

Pipelining

Page 19: MapReduce  VS Parallel DBMSs

In a parallel DBMS, every node knows what it should do

MR system is scheduled on processing nodes one storage block at a time.

Scheduling

Page 20: MapReduce  VS Parallel DBMSs

Vertica Reads only the attributes necessary for

solving the user query DBMS-X and Hadoop are both row stores

Column-oriented storage

Page 21: MapReduce  VS Parallel DBMSs

MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.

What should MR learn from Parallel DBMS

Page 22: MapReduce  VS Parallel DBMSs

MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice

Conclusion

Page 23: MapReduce  VS Parallel DBMSs

Thank you~~

Questions?