MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides (http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/ind ex.html) (licensed under Creation Commons Attribution 3.0 License) I used also some ideas from chapter 2 from the book by Anand Rajaraman and Jeff Ullman: "Mining of Massive Datasets“ (http://i.stanford.edu/~ullman/mmds.html)
75
Embed
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides (jimmylin/cloud-2010-Spring/index.html) (licensed.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
MapReduce and Data Management
Based on slides from Jimmy Lin’s lecture slides (http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html) (licensed under Creation Commons Attribution 3.0 License)I used also some ideas from chapter 2 from the book by Anand Rajaraman and Jeff Ullman: "Mining of Massive Datasets“ (http://i.stanford.edu/~ullman/mmds.html)
MapReduce Algorithm Design
MapReduce: Recap• Programmers must specify:
map (k, v) → list(<k’, v’>)reduce (k’, list(v’)) → <k’’, v’’>– All values with the same key are reduced together
• Optionally, also:partition (k’, number of partitions) → partition for k’– Often a simple hash of the key, e.g., hash(k’) mod n– Divides up key space for parallel reduce operationscombine (k’, v’) → <k’, v’>*– Mini-reducers that run in memory after the map phase– Used as an optimization to reduce network traffic
• The execution framework handles everything else…
combinecombine combine combine
ba 1 2 c 9 a c5 2 b c7 8
partition partition partition partition
mapmap map map
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
“Everything Else”• The execution framework handles everything else…
– Scheduling: assigns workers to map and reduce tasks
– “Data distribution”: moves processes to data
– Synchronization: gathers, sorts, and shuffles intermediate data
– Errors and faults: detects worker failures and restarts
• Limited control over data and execution flow
– All algorithms must expressed in m, r, c, p
• You don’t know:
– Where mappers and reducers run
– When a mapper or reducer begins or finishes
– Which input a particular mapper is processing
– Which intermediate key a particular reducer is processing
Tools for Synchronization• Cleverly-constructed data structures
– Bring partial results together
• Sort order of intermediate keys– Control order in which reducers process keys
• Partitioner– Control which reducer processes which keys
• Preserving state in mappers and reducers– Capture dependencies across multiple keys and values
• Set of each per slave machine:– Tasktracker (TT)– Datanode (DN)
Putting everything together…
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
datanode daemon
Linux file system
…
tasktracker
slave node
namenode
namenode daemon
job submission node
jobtracker
Anatomy of a Job• MapReduce program in Hadoop = Hadoop job
– Jobs are divided into map and reduce tasks– An instance of running a task is called a task attempt– Multiple jobs can be composed into a workflow
• Job submission process– Client (i.e., driver program) creates a job, configures it, and submits
it to job tracker– JobClient computes input splits (on client end)– Job data (jar, configuration XML) are sent to JobTracker– JobTracker puts job data in shared location, enqueues tasks– TaskTrackers poll for tasks– Off to the races…
InputSplit
Source: redrawn from a slide by Cloduera, cc-licensed
– What is the proper treatment of dangling nodes?– How do we factor in the random jump factor?
• Solution: – Second pass to redistribute “missing PageRank mass” and account
for random jumps
– p is PageRank value from before, p' is updated PageRank value– |G| is the number of nodes in the graph– m is the missing PageRank mass
p
G
m
Gp )1(
1'
PageRank Convergence
• Alternative convergence criteria– Iterate until PageRank values don’t change– Iterate until PageRank rankings don’t change– Fixed number of iterations
• Convergence for web graphs?
Beyond PageRank• Link structure is important for web search
– PageRank is one of many link-based features: HITS, SALSA, etc.
– One of many thousands of features used in ranking…
• Adversarial nature of web search– Link spamming– Spider traps– Keyword stuffing– …
Efficient Graph Algorithms
• Sparse vs. dense graphs
• Graph topologies
Figure from: Newman, M. E. J. (2005) “Power laws, Pareto distributions and Zipf's law.” Contemporary Physics 46:323–351.
Power Laws are everywhere!
Local Aggregation
• Use combiners!– In-mapper combining design pattern also
applicable
• Maximize opportunities for local aggregation– Simple tricks: sorting the dataset in specific
ways
Mapreduce and Databases
Relational Databases vs. MapReduce
• Relational databases:
– Multipurpose: analysis and transactions; batch and interactive
– Data integrity via ACID transactions
– Lots of tools in software ecosystem (for ingesting, reporting, etc.)
– Supports SQL (and SQL integration, e.g., JDBC)
– Automatic SQL query optimization
• MapReduce (Hadoop):
– Designed for large clusters, fault tolerant
– Data is accessed in “native format”
– Supports many query languages
– Programmers retain control over performance
– Open source
Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)
– Tasks: relatively small set of “standard” transactional queries– Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)• OLAP (online analytical processing)
– Typical applications: business intelligence, data mining– Back-end processing: batch workloads, less concurrency
– Tasks: complex analytical queries, often ad hoc– Data access pattern: table scans, large amounts of data involved per
query
OLTP/OLAP Architecture
OLTP OLAP
ETL(Extract, Transform, and Load)
OLTP/OLAP Integration
• OLTP database for user-facing transactions
– Retain records of all activity
– Periodic ETL (e.g., nightly)
• Extract-Transform-Load (ETL)
– Extract records from source
– Transform: clean data, check integrity, aggregate, etc.
– Load into OLAP database
• OLAP database for data warehousing
– Business intelligence: reporting, ad hoc queries, data mining, etc.
– Feedback to improve OLTP services
OLTP/OLAP/Hadoop Architecture
OLTP OLAP
ETL(Extract, Transform, and Load)
Hadoop
Why does this make sense?
ETL Bottleneck
• Reporting is often a nightly task:– ETL is often slow: why?– What happens if processing 24 hours of data takes longer than 24
hours?• Hadoop is perfect:
– Most likely, you already have some data warehousing solution– Ingest is limited by speed of HDFS– Scales out with more nodes– Massively parallel– Ability to use any processing tool– Much cheaper than parallel databases– ETL is a batch process anyway!
MapReduce algorithms for processing relational data
Relational Algebra
• Primitives
– Projection ()
– Selection ()
– Cartesian product ()
– Set union ()
– Set difference ()– Rename ()
• Other operations
– Join ( )⋈– Group by… aggregation
– …
Projection
R1
R2
R3
R4
R5
R1
R2
R3
R4
R5
Projection in MapReduce• Easy!
– Map over tuples, emit new tuples with appropriate attributes
– Reduce: take tuples that appear many times and emit only one version (duplicate elimination)
• Tuple t in R: Map(t, t) -> (t’,t’)
• Reduce (t’, [t’, …,t’]) -> [t’,t’]
• Basically limited by HDFS streaming speeds
– Speed of encoding/decoding tuples becomes important
– Relational databases take advantage of compression
– Semistructured data? No problem!
Selection
R1
R2
R3
R4
R5
R1
R3
Selection in MapReduce
• Easy!
– Map over tuples, emit only tuples that meet criteria
– No reducers, unless for regrouping or resorting tuples (reducers are the identity function)
– Alternatively: perform in reducer, after some other processing
• But very expensive!!! Has to scan the database– Better approaches?
Union, Set Intersection and Set Difference
• Similar ideas: each map outputs the tuple pair (t,t). For union, we output it once, for intersection only when in the reduce we have (t, [t,t])
• For Set difference?
Set Difference
- Map Function: For a tuple t in R, produce key-value pair (t, R), and for a tuple t in S, produce key-value pair (t, S).
- Reduce Function: For each key t, do the following.
1. If the associated value list is [R], then produce (t, t).
2. If the associated value list is anything else, which could only be [R, S], [S, R], or [S], produce (t, NULL).
Group by… Aggregation
• Example: What is the average time spent per URL?