Top Banner
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com http://www.monash.com http://www.DBMS2.com
27

MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com .

Dec 29, 2015

Download

Documents

Sibyl Mills
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

MapReduce

With a SQL-MapReduce focus

byCurt A. Monash, Ph.D.

President, Monash ResearchEditor, DBMS2

contact @monash.comhttp://www.monash.comhttp://www.DBMS2.com

Page 2: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Curt Monash

Analyst since 1981 Covered DBMS since the pre-relational days Also analytics, search, etc.

Publicly available research Blogs, including DBMS2 (http://www.DBMS2.com) Feed at http://www.monash.com/blogs.html

User and vendor consulting

Page 3: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Agenda

Introduction and truisms MapReduce overview MapReduce specifics SQL and MapReduce together

Page 4: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Monash’s First Law of Commercial Semantics

Bad jargon drives out good

For example: “Relational”, “Parallel”, “MapReduce”

Page 5: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Where to measure database technology

Language interpretation and execution capabilities Functionality Speed

Administrative capabilities How well it all works

Fit and finish Reliability

How much it all – really – costs

You can do anything in 0s and 1s … but how much effort will it actually take?

Page 6: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

What’s hard about parallelization*

Getting the right data … … to the right nodes … … at the right time … … while dealing with errors … … and without overloading the network

Otherwise, programming a grid is a lot like programming a single node.

*in general -- not just for “database” technology

Page 7: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

MPP DBMS are good at parallelization …

… under three assumptions, namely:

You can express the job nicely in SQL … ... or whatever other automatically-parallel

languages the DBMS offers You don’t really need query fault-tolerance …

… which is usually the case unless you have 1000s of nodes

There’s enough benefit to storing the data in tables to justify the overhead

Page 8: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

SQL commonly gets frustrating …

… when you’re dealing with sequences of events or relationships, because:

Self-joins are expensive Programming is hard when you’re not sure how

long the sequence is For example:

Clickstreams Financial data time series Social network graph analysis

Page 9: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

The pure MapReduce alternative

Lightweight approach to parallelization

The only absolute requirement is a certain simple programming model … … so simple that parallelization is “automatic” … … and very friendly to procedural languages

It doesn’t require a DBMS on the back end No SQL required!

Non-DBMS implementations commonly have query fault-tolerance

But you have to take care of optimizing data redistribution yourself

Page 10: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

MapReduce evolution

Used under-the-covers for quite a while Named and popularized by Google Open-sourced in Hadoop Widely adopted by big web companies Integrated (at various levels) into MPP RDBMS Adopted for social network analysis Explored/investigated for data mining

applications ???

Page 11: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

M/R use cases -- large-scale ETL

Text indexing This is how Google introduced the MapReduce concept

Time series disaggregation Clickstream sessionization and analytics Stock trade pattern identification

Relationship graph traversal

Page 12: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

M/R use cases – hardcore arithmetic

Statistical routines Data “cooking”

Page 13: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

The essence of MapReduce

“Map” steps Data redistribution “Reduce” steps In strict alternation … … or not-so-strict

Page 14: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

“Map” step basics (reality)

Input = anything Set of data Output of previous Reduce step

Output = anything There’s an obvious key

Page 15: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Map step basics (formality)

Input = {<key, value> pairs} Output = {<key, value> pairs} Input and output key types don’t have to be the

same

“Embarrassingly parallel” based on key

Page 16: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Map step examples

Word count Input format = document/text string Output format = <WordName, 1>

Text indexing Input format = document/text string Output format = <WordName, (DocumentID, Offset)>

Log parsing Input format = log file Output format = <Key, formatted event>

Page 17: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Reduce step basics

Input = {<key, value> pairs}, where all the keys are equal

Output = {<key, value> pairs}, where the set commonly has cardinality = 1

Input and output key types don’t have to be the same

Just like Map, “embarrassingly parallel” based on key

Page 18: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Reduce step examples

Word count Input format = <WordName, 1> Output format = <WordName, count>

Text indexing Input format = <WordName, (DocumentID, Offset)> Output format = <WordName, index file>

Log parsing E.g., input format = <UserID or EventID, event record> E.g., output format = <Same, reformatted event record>

Page 19: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

More honoured in the breach than in the observance!

Page 20: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Sometimes the Reduce step is trivial

MapReduce for data mining

Partition on some key Calculate a single vector* for each whole partition Aggregate the vectors Hooray!

*Algorithm-dependent

Page 21: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Sometimes Reduce doesn’t reduce

Tick stream data “cooking” can increase its size by one to two orders of magnitude

Sessionization might just add a column – SessionID – to records Or is that a Map step masquerading as a Reduce?

Page 22: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Some reasons to integrate SQL and MapReduce

JOINs were invented for a reason So was SQL 2003 It’s kind of traditional to keep data in an RDBMS

Page 23: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Some ways to integrate SQL and MapReduce

A SQL layer built on a MapReduce engine E.g., Facebook’s Hive over Hadoop But building a DBMS-equivalent is hard

MapReduce invoking SQL SQL invoking MapReduce

Aster’s SQL M/R

Page 24: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

To materialize or not to materialize?

DBMS avoidance of intermediate materialization much better performance

Classic MapReduce intermediate materialization query fault-tolerance

How much does query fault-tolerance matter? (Query duration) x (Node count) vs. Node MTTF

DBMS-style materialization strategies usually win

Page 25: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Other reasons to put your data in a real database

Query response time General performance Backup Security General administration SQL syntax General programmability and connectivity

Page 26: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Aspects of Aster’s approach to MapReduce

Data stored in a database MapReduce execution managed by a DBMS Flexible MapReduce syntax MapReduce invoked via SQL

Page 27: MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2 contact @monash.com  .

Further information

Curt A. Monash, Ph.D.President, Monash Research

Editor, DBMS2

contact @monash.comhttp://www.monash.comhttp://www.DBMS2 com