Dataset Skew Resilience in Main Memory Parallel Hash Joins · The rapid rise of big data and data analytics are driving the need for greater algorithmic efficiency. Parallel main

Examples of Dataset Skew▪ Joins on non-primary key columns (duplicate

keys)

▪ Parallel multiway joins

▪ Compound queries (multiple joins and

aggregations)

▪ Many examples in real-world data

▪ The size of cities and the length and

frequency of words can be modeled with

Zipfian distributions

▪ Measurement errors and IQ scores often

follow Gaussian distributions

▪ Dataset skew can occur in attributes as well

as in the correlation between relations

Experiments▪ We generated large datasets (hundreds of millions of records) that

vary in terms of distribution, correlation, shuffling, and skew on the

build and probe tables

▪ We measured hash join execution time, with three different hash

table implementations

▪ Separate Chaining hash table (SC) is the original baseline

▪ Modified Separate Chaining hash table (MSC)

▪ Custom Cuckoo Hashing implementation (CCH)

The rapid rise of big data and data analytics are driving the need for greater algorithmic efficiency. Parallel main memory hash joins are prescribed to accelerate database

joins. However, the performance of these joins can be hindered by dataset skew. We tested four popular hash join algorithms on an extensive array of datasets. Our results

demonstrate that hash joins are acutely affected by dataset skew, and performance is further hampered if the data is unordered. To address these issues, we propose two

different hash tables. First, we use a separate chaining hash table that is based on an existing implementation that we have modified. This version outperforms the original

implementation on skewed datasets by up to three orders of magnitude. Second, we propose a novel hash table that can further improve performance by up to 17.3x.

Dataset Skew Resilience in Main Memory Parallel

Hash JoinsPuya Memarzia, Virendra Bhavsar, and Suprio Ray

Faculty of Computer Science, University of New Brunswick, Fredericton, New Brunswick, Canada

ABSTRACT

References: [1] Image from Wikimedia commons - https://commons.wikimedia.org/wiki/File:Zipf_distribution_CMF.png#/media/File:Zipf_distribution_CMF.png - CC-BY-SA 3.0 [2] By Inductiveload (self-made, Mathematica,

Inkscape) [Public domain], via Wikimedia Commons https://commons.wikimedia.org/w/index.php?curid=3817960

(a) Zipfian distribution [1]

MotivationIn-memory hash joins on datasets that are

skewed and/or shuffled are significantly

slower than ordered non-skewed datasets.

The performance penalty can be several

orders of magnitude.

▪ Throwing more hardware at the problem

is not always an option.

▪ The design and implementation of the

hash table is one of the main bottlenecks

▪ Join time deteriorates due to increased

memory lookups, lock contention, and

CPU cache and TLB misses

Hash Join ConfigurationsHash joins can leverage data parallelism to scale up on multicore processors. Hash join

configurations can be categorized based on their partitioning scheme (or lack thereof) and

parallel implementation.

1. No Partitioning (Nopart): one large lock-protected hash table shared by all threads

2. Shared Partitioning (Part-Share): multiple lock-protected partitions shared by all threads

3. Independent Partitioning (Part-Indep): private lock-less partitions for each thread

4. Radix Partitioning (Part-Radix): a hierarchy of partitions with dynamic load balancing

Figure 2. The performance impact of dataset skew and shuffling on

four hash join algorithms (note that y axis is log scale)

0

0

0

0

0

0

0

0

1

10

100

1,000

10,000

100,000

1,000,000

Nopart Part-Share Part-Indep Part-Radix

Ru

nti

me

(C

PU

Cyl

ces)

-lo

g 10

scal

e

Hash Join Configuration

Non-skewed Ordered Non-skewed Shuffled Skewed Shuffled1014

1013

1012

1011

1010

109

108

107

106

105

104

103

102

101

1

Figure 1. Example of a hash join between two corresponding tables or partitions from each table – using a separate chaining hash table

Figure 4. Original (SC) versus modified hash

table (MSC) on skewed dataset

(b) Gaussian distribution [2]

Figure 3. Cumulative distribution function of data distributions that

mimic some skewed datasets

0

20

40

60

80

100

120

140

160

Intel Skylake Intel Harpertown AMD K8

Ru

nti

me (

CP

U C

ycle

s)

Billio

ns

CPU Architecture

Nopart_CCH

Nopart_MSC

Part-Share_MSC

Part-Indep_MSC

Part-Radix_MSC

0

0

0

0

0

0

0

0

1

10

100

1,000

10,000

100,000

1,000,000

Nopart Part-Share Part-Indep Part-Radix

Ru

nti

me

(C

PU

Cyl

ces)

-lo

g 10

scal

e

Hash Join Configuration

SC MSC1014

1013

1012

1011

1010

109

108

107

106

105

104

103

102

101

1

Conclusion• Hash table skew resilience is an important feature to consider when

designing parallel hash join implementations

• Our approach improves hash join times across a variety of datasets

and architectures, allowing for more work with the same resources

• The dataset and CPU can help determine the most efficient method

• Applications such as machine learning, analytics, graph processing,

and data warehousing, could benefit from skew resilient algorithms

.

.

.

Financial

40

IT

20

HR

30

Hash FunctionBuild Relation Hash Table

dept_part0

desc deptno

Marketing 10

IT 20

HR 30

Financial 40

Probe Relation

emp_part0

ename deptno

Gary 10

Dulley 10

Reiter 30

Taylor 20

Prevatt 30

Marketing

10

NULL

Hash Function

Gary, Marketing

Dulley, Marketing

Output

.

.

....

Original Data

(optional)

0

50

100

150

200

250

300

350

400

Cycle

s p

er

ou

tpu

t tu

ple

Configuration

Part Build Probe

0

50

100

150

200

250C

ycle

s p

er

ou

tpu

t tu

ple

Configuration

Part Build Probe

(b) Shuffled dataset

Figure 6. Breakdown of hash join phases on Zipf-skewed datasets

(a) Ordered dataset

Figure 5. Evaluation of three different CPU

architectures on shuffled dataset

Sponsored by:

Dataset Skew Resilience in Main Memory Parallel Hash Joins · The rapid rise of big data and data analytics are driving the need for greater algorithmic efficiency. Parallel main

Documents