Top Banner
Irregular Algorithms & Data Structures John Owens Associate Professor, Electrical and Computer Engineering SciDAC Institute for Ultrascale Visualization University of California, Davis
78

Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Jul 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Irregular Algorithms & Data Structures

John OwensAssociate Professor, Electrical and Computer Engineering

SciDAC Institute for Ultrascale VisualizationUniversity of California, Davis

Page 2: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Design Principles

• Data layouts that:

• Minimize memory traffic

• Maximize coalesced memory access

• Algorithms that:

• Exhibit data parallelism

• Keep the hardware busy

• Minimize divergence

Page 3: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Dense Matrix Multiplication

• for all elements E in destination matrix P

• Pr,c = Mr • Nc

M P

WIDTH

WIDTH

WIDTH WIDTH

N

Page 4: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Dense Matrix Multiplication• P = M * N of size WIDTH x WIDTH

• With blocking:

• One thread block handles one BLOCK_SIZE x BLOCK_SIZE sub-matrix Psub of P

• M and N are only loaded WIDTH / BLOCK_SIZEtimes from globalmemory

• Great saving ofmemory bandwidth!

M

N

P

Psub

BLOCK_SIZEBLOCK_SIZE BLOCK_SIZE BLOCK_SIZE

BLOCK_SIZE

BLOCK_SIZEBLOCK_SIZEBLOCK_SIZE

WIDTH

WIDTH

WIDTH WIDTH

N

Page 5: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Dense Matrix Multiplication• Data layouts that:

• Minimize memory traffic

• Maximize coalesced memory access

• Algorithms that:

• Exhibit data parallelism

• Keep the hardware busy

• Minimize divergence

M

N

P

Psub

BLOCK_SIBLOCK_SI BLOCK_SI BLOCK_SI

BLOCK_SI

BLOCK_SIBLOCK_SIBLOCK_SI

WIDT

WIDT

WIDT WIDT

N

Page 6: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Sparse Matrix-Vector Multiply: What’s Hard?

• Dense approach is wasteful

• Unclear how to map work to parallel processors

• Irregular data access

Page 7: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Go see the paper!

• “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors” by Nathan Bell and Michael Garland, NVIDIA Research

• Tuesday Nov 17, 2–2:30p, PB252

Page 8: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Sparse Matrix Formats

Structured Unstructured

(DIA

) Diag

onal

(ELL)

ELLP

ACK

(COO) C

oord

inate

(CSR

) Com

pres

sed R

ow(H

YB) H

ybrid

Page 9: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Diagonal Matrices

• Diagonals should be mostly populated

• Map one thread per row

• Good parallel efficiency

• Good memory behavior [column-major storage]

-2 0 1

Page 10: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Irregular Matrices: ELL

• Assign one thread per row again

• But now:

• Load imbalance hurts parallel efficiency

0 2

0 1 2 3 4 5

0 2 3

0 1 2

1 2 3 4 5

5

padding

Page 11: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Irregular Matrices: COO

• General format; insensitive to sparsity pattern, but ~3x slower than ELL

• Assign one thread per element, combine results from all elements in a row to get output element

• Req segmented reduction, communication btwn threads

00 02

10 11 12 13 14 15

20 22 23

30 31 32

41 42 43 44 45

55no padding!

Page 12: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Thread-per-{element,row}

Page 13: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Irregular Matrices: HYB

• Combine regularity of ELL + flexibility of COO

0 2

0 1 2

0 2 3

0 1 2

1 2 3

5

13 14 15

44 45

“Typical” “Exceptional”

Page 14: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

SpMV: Summary• Ample parallelism for large matrices

• Structured matrices (dense, diagonal): straightforward

• Take-home message: Use data structure appropriate to your matrix

• Sparse matrices: Issue: Parallel efficiency

• ELL format / one thread per row is efficient

• Sparse matrices: Issue: Load imbalance

• COO format / one thread per element is insensitive to matrix structure

• Conclusion: Hybrid structure gives best of both worlds

• Insight: Irregularity is manageable if you regularize the common case

Page 15: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hash Tables & Sparsity

• Lefebvre and Hoppe, Siggraph 2006

Page 16: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Scalar Hashing

Page 17: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Scalar Hashing

key

#

Linear Probing

Page 18: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Scalar Hashing

key

#

Linear Probing

key

#2

#1

Double Probing

Page 19: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Scalar Hashing

key

#

Linear Probing

key

#2

#1

Double Probing

key

#1

#2

Chaining

Page 20: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Scalar Hashing: Parallel Problems

key

#

key#2

#1

key

#1

#2

• Construction and Lookup

• Variable time/work per entry

• Construction

• Synchronization / shared access to data structure

Page 21: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Parallel Hashing: The Problem• Hash tables are good for sparse data.

• Input: Set of key-value pairs to place in the hash table

• Output: Data structure that allows:

• Determining if key has been placed in hash table

• Given the key, fetching its value

• Could also:

• Sort key-value pairs by key (construction)

• Binary-search sorted list (lookup)

• Recalculate at every change

Page 22: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Parallel Hashing: What We Want• Fast construction time

• Fast access time

• O(1) for any element, O(n) for n elements in parallel

• Reasonable memory usage

• Algorithms and data structures may sit at different places in this space

• Perfect spatial hashing has good lookup times and reasonable memory usage but is very slow to construct

Page 23: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Level 1: Distribute into bucketsKeys

Data distributed into buckets

Page 24: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Level 1: Distribute into bucketsKeys

Data distributed into buckets

Bucket ids

Page 25: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Level 1: Distribute into bucketsKeys

Data distributed into buckets

h

Bucket ids

Page 26: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Level 1: Distribute into bucketsKeys

Bucket sizes

Data distributed into buckets

Atomic add

Local offsets

8 5 6 8

1 3 7 0 5 2 4 6

Bucket ids

Page 27: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Level 1: Distribute into bucketsKeys

Bucket sizes

Data distributed into buckets

Local offsets

8 5 6 8

1 3 7 0 5 2 4 6 8 13 190

Bucket ids

Global offsets

Page 28: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Level 1: Distribute into bucketsKeys

Bucket sizes

Data distributed into buckets

Local offsets

8 5 6 8

1 3 7 0 5 2 4 6 8 13 190

Bucket ids

Global offsets

Page 29: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Parallel Hashing: Level 1

• Good for a coarse categorization

• Possible performance issue: atomics

• Bad for a fine categorization

• Space requirements for n elements to (probabilistically) guarantee no collisions are O(n2)

Page 30: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing in Parallel

0123

0123

Page 31: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing in Parallel

0123

ha

0123

Page 32: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing in Parallel

0123

ha

1

3

2

0

0123

Page 33: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing in Parallel

0123

ha

1

3

2

0

0123

Page 34: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing in Parallel

0123

ha

1

3

2

0

hb

1

3

2

1

0123

Page 35: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing in Parallel

0123

ha

1

3

2

0

hb

1

3

2

1

0123

Page 36: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing in Parallel

0123

ha

1

3

2

0

hb

1

3

2

1

0123

Page 37: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Cuckoo Hashing Construction

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 38: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Cuckoo Hashing Construction

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 39: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Cuckoo Hashing Construction

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 40: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Cuckoo Hashing Construction

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 41: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Cuckoo Hashing Construction

01

01

1

0

1

0

h1

1

1

0

0

h2 T1 T2

• Lookup procedure: in parallel, for each element:

• Calculate h1 & look in T1;

• Calculate h2 & look in T2; still O(1) lookup

Page 42: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Cuckoo Construction Mechanics

• Level 1 created buckets of no more than 512 items

• Average: 409; probability of overflow: < 10-6

• Level 2: Assign each bucket to a thread block, construct cuckoo hash per bucket entirely within shared memory

• Semantic: Multiple writes to same location must have one and only one winner

• Our implementation uses 3 tables of 192 elements each (load factor: 71%)

• What if it fails? New hash functions & start over.

Page 43: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Timings on random voxel data

Page 44: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Hashing: Big Ideas

• Classic serial hashing techniques are a poor fit for a GPU.

• Serialization, load balance

• Solving this problem required a different algorithm

• Both hashing algorithms were new to the parallel literature

• Hybrid algorithm was entirely new

Page 45: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Trees: Motivation• Query: Does object

X intersect with anything in the scene?

• Difficulty: X and the scene are dynamic

• Goal: Data structure that makes this query efficient (in parallel)

Images from HPCCD: Hybrid Parallel Continuous Collision Detection, Kim et al., Pacific Graphics 2009

Page 46: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

k-d trees

Images from Wikipedia, “Kd-tree”

Page 47: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Generating Trees

• Increased parallelism with depth

• Irregular work generation

Page 48: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Generating Trees

• Increased parallelism with depth

• Irregular work generation

Page 49: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Generating Trees

• Increased parallelism with depth

• Irregular work generation

Page 50: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Generating Trees

• Increased parallelism with depth

• Irregular work generation

Page 51: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

Page 52: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

Page 53: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

B C

Page 54: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C0 1B C

Page 55: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F

0 1B C

Page 56: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

D E F

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F

0 1B C

Page 57: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

D E F

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F

0 1B C

Page 58: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F

0 1B C

Page 59: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

D E F

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F

0 1B C

Page 60: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

D E F

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F0 21 3

0 1B C

Page 61: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

D E F

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F0 21 3

0 1B C

Page 62: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

D E F

Tree Construction on a GPU

• At each stage, any node can generate 0, 1, or 2 new nodes

• Increased parallelism, but some threads wasted

• Compact after each step?

0 A

B C

D E F0 21 3

0 1B C

d0 f0 d1 e1 f1

left right

Page 63: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• Compact reduces overwork, but …

• … requires global compact operation per step

• Also requires worst-case storage allocation

0

B C

A

D E F

Page 64: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• Compact reduces overwork, but …

• … requires global compact operation per step

• Also requires worst-case storage allocation

0

B C

A

D E F

B C

Page 65: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• Compact reduces overwork, but …

• … requires global compact operation per step

• Also requires worst-case storage allocation

0

B C

A

D E F

0 1B C

Page 66: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• Compact reduces overwork, but …

• … requires global compact operation per step

• Also requires worst-case storage allocation

0

B C

A

D E F

0 1B C

D E F

Page 67: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• Compact reduces overwork, but …

• … requires global compact operation per step

• Also requires worst-case storage allocation

0

B C

A

D E F

0 1B C

D E F D E F

Page 68: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• Compact reduces overwork, but …

• … requires global compact operation per step

• Also requires worst-case storage allocation

0

B C

A

D E F0 1 2

0 1B C

D E F D E F

Page 69: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Tree Construction on a GPU

• Compact reduces overwork, but …

• … requires global compact operation per step

• Also requires worst-case storage allocation

0

B C

A

D E F0 1 2

0 1B C

D E F

d0 f0 d1 e1 f1

left right

D E F

Page 70: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Assumptions of Approach

• Fairly high computation cost per step

• Smaller cost -> runtime dominated by overhead

• Small branching factor

• Makes pre-allocation tractable

• Fairly uniform computation per step

• Otherwise, load imbalance

• No communication between threads at all

Page 71: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Work Queue Approach

• Allocate private work queue of tasks per core

• Each core can add to or remove work from its local queue

• Cores mark self as idle if {queue exhausts storage, queue is empty}

• Cores periodically check global idle counter

• If global idle counter reaches threshold, rebalance work

Fast Hierarchy Operations on GPU Architectures, Lauterbach et al.

Page 72: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Static Task List

Input Input Input Input Input

SM SM SM SM SM

Output

Atomic Ptr

restartkernel

Next 4 slides: Daniel Cederman and Philippas Tsigas, On Dynamic Load

Balancing on Graphics Processors. Graphics

Hardware 2008, June 2008.

Page 73: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Blocking Dynamic Task Queue

SM SM SM SM SM

Queue

Lock

Lock

• Poor performance

• Scales poorly with # of blocks

Page 74: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Non-Blocking Dynamic Task Queue

SM SM SM SM SM

Atomic Head Ptr

Queue

Atomic Tail Ptr

lazy updateof pointers

• Better performance

• Scales well with small # of blocks, but poorer with large

Page 75: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Work Stealing

I/ODeque

I/ODeque

I/ODeque

I/ODeque

I/O Deque

SM SM SM SM SM

Lock Lock ...

• Best performance and scalability

Page 76: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Big-Picture Questions

• Relative cost of computation vs. overhead

• Frequency of global communication

• Cost of global communication

• Need for communication between GPU cores?

• Would permit efficient in-kernel work stealing

Page 77: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Thanks to ...

• Nathan Bell, Michael Garland, David Luebke, and Dan Alcantara for helpful comments and slide material.

• Funding agencies: Department of Energy (SciDAC Institute for Ultrascale Visualization, Early Career Principal Investigator Award), NSF, BMW, NVIDIA, HP, Intel, UC MICRO, Rambus

Page 78: Irregular Algorithms & Data Structures - Nvidia · 2009-11-24 · Parallel Hashing: The Problem • Hash tables are good for sparse data. • Input: Set of key-value pairs to place

Bibliography• Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta,

Michael Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. ACM Transactions on Graphics, 28(5), December 2009.

• Nathan Bell and Michael Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. Proceedings of IEEE/ACM Supercomputing, November 2009.

• Daniel Cederman and Philippas Tsigas, On Dynamic Load Balancing on Graphics Processors. Graphics Hardware 2008, June 2008.

• C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha. Fast BVH Construction on GPUs. Computer Graphics Forum (Proceedings of Eurographics 2009), 28(2), April 2009.

• Christian Lauterbach, Qi Mo, and Dinesh Manocha. Fast Hierarchy Operations on GPU Architectures. Tech report, April 2009, UNC Chapel Hill.

•  Kun Zhou, Qiming Hou, Rui Wang, and Baining Guo. Real-Time KD-Tree Construction on Graphics Hardware. ACM Transactions on Graphics, 27(5), December 2008.