Top Banner
Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems
38

Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Dec 18, 2015

Download

Documents

Claude George
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Anti-Skew: Single-Key Data Skew Mitigation for MapReduce

Yue ChenFlorida State University

Advanced Database Systems

Page 2: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Outline

• Background• Data Skew• Anti-skew Design• Conclusion• Related Work

Page 3: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Background

Skip

Page 4: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Big Data Trend

Page 5: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Big Data Trend

Mike Olson is a co-founder and former CEO of Cloudera.

Page 6: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Big Data Trend

Page 7: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

History

Page 8: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Review of MapReduce Word Count

Page 9: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Data Skew

Page 10: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

What is data skew?

Attribute1

0 1 2 3 4 5 6 7

Key8Key7Key6Key5Key4Key3Key2Key1

Page 11: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

What is data skew?

Attribute1

0 5 10 15 20 25

Key8Key7Key6Key5Key4Key3Key2Key1

Page 12: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

What is data skew?

Attribute1

0 50 100 150 200 250 300 350

Key8Key7Key6Key5Key4Key3Key2Key1

Page 13: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.
Page 14: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

The Most Skewed Key?

NULL

Reported by the data team of

Page 15: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Anti-skew Design

Page 16: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Problem

Reducer1

Key1

Key3

Key2

Reducer2

Key4

Key6

Key5

Page 17: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Assumption1

• The task can be divided into sub-tasks, and can be reassembled back to get the result in an easy way.

Page 18: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Assumption2

• The key-value pairs in input data are near-equally distributed, which means sampling would be effective; although pre-execution sampling is not required.

Page 19: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Basic Idea

Key1

Key1.1

Key1.2

Key1.3

Key1.4

Key1.5

Key1.6

Page 20: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Skew Perception

Needs visualization!

Page 21: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Skew Detection

Page 22: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Straggler Identification (tentative)

• A certain key’s count is more than 50% (100%? 200%?) of the median one.

Straggler

Page 23: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Straggler Identification (tentative)

Attribute10

5

10

15

20

25

Key1Key2Key3Key4Key5Key6Key7Key8

Page 24: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Key Splitting

the 8fc42c6ddf9966db3b09e84365034357

6e9b31333e61aad015fa16a3a5fe8e0d

2e20bfee9e4486f0ab651fc0bb988ffb

Hash

Rehash

Rehash

Page 25: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Key Splitting

the

8fc42c6ddf9966db3b09e84365034357

6e9b31333e61aad015fa16a3a5fe8e0d

2e20bfee9e4486f0ab651fc0bb988ffb

Special Key

Page 26: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Load Balancing (tentative)

• Can the hashing algorithm combined with the platform’s partition algorithm evenly distribute the keys to reducers?

Partitioner Function

Page 27: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Privacy (if pre-processed)

Name Net worth (USD)

Bill Gates $79.2 billionCarlos Slim $77.1 billion

Warren Buffett $72.7 billionAmancio Ortega $64.5 billionLarry Ellison $54.3 billion

Charles Koch $42.9 billion

David Koch $42.9 billionChristy Walton $41.7 billion

Page 28: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Privacy (if pre-processed)

Name Net worth (USD)

8adc1a86f7 $79.2 billion8ea0bb9a8f $77.1 billion

9e640e0fe9 $72.7 billionabf803fe43 $64.5 billionbce5c74f58 $54.3 billion

4f589f4867 $42.9 billion

4867dbd572 $42.9 billione9ca9f808c $41.7 billion

Page 29: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Conclusion

• A simple way to handle single-key skew in the MapReduce programming model

• No extra OS-level resources needed• Implement it as a wrapper, no need to modify

platforms’ source code, can be used for online platforms (there are so many Hadoop distributions, versions and patches!)

Page 30: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Hadoop Distributions

Page 31: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.
Page 32: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.
Page 33: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Related Work

Skip

Page 34: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SkewReduce

1

2

13

14

15

5

69

3

412

7

810

11

• Varying granularities of partitions• Can we automatically find a good partition plan and

schedule?

Page 35: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SkewReduce

• Goal: minimize expected total runtime

Sample

SkewReduceOptimizer

1

2

13

14

15

5

69

3

412

7

810

11

Clusterconfiguration

Cost functions

Page 36: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SkewTune

• Does what SkewReduce does when the program is running.

• Skew detected -> Stop -> Repartition -> Continue

Page 37: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SpongeFiles

Page 38: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Q&ASuggestions?