Multi-dimensional Range Query Processing on the GPU

Multi-dimensional Range Query Processing

on the GPU

Beomseok NamDate Intensive Computing Lab

School of Electrical and Computer EngineeringUlsan National Institution of Science and Technology, Korea

Multi-dimensional Indexing

• One of the core technology in GIS, scientific databases, computer graphics, etc.

• Access pattern into Scientific Datasets– Multidimensional Range Query

• Retrieves data that overlaps given rangeof values

• Ex) SELECT temperature FROM dataset WHERE latitude BETWEEN 20 AND 30 AND longitude BETWEEN 50 AND 60

– Multidimensional indexing trees

• KD-Trees, KDB-Trees, R-Trees, R*-Trees• Bitmap index

– Multi-dimensional indexing is one of thethings that do not work well in parallel.

Multi-dimensional Indexing Trees: R-Tree

• Proposed by Antonin Guttman (1984)

• Stored and indexed via nested MBRs (Minimum Bounding Rectangles)

• Resembles height-balanced B+-tree

Multi-dimensional Indexing Trees: R-Tree

An Example Structure of an R-Tree

Source:http://en.wikipedia.org/wiki/Image:R-tree.jpg

• Proposed by A. Guttman

• Stored and indexed via nested MBRs (Minimum Bounding Rectangles)

• Resembles height-balanced B+-tree

Motivation

• GPGPU has emerged as new HPC parallel computing paradigm.

• Scientific data analysis applications are major applications in HPC market.

• A common access pattern into scientific datasets is multi-dimensional range query.

• Q: How to parallelize multi-dimensional range query on the GPU?

MPES (Massively Parallel Exhaustive Scan)

• This is how GPGPU is currently utilized

• Achieve the maximum utilization of GPU.

• Simple, BUT we should access ALL the datasets.

…

Divide the Total datasets by the number of threads

thread[0] thread[1] thread[2] thread[3] thread[K-1]

• Basic idea– Compare a given query range with multiple MBRs of child

nodes in parallel

Parallel R-Tree Search

Each SPcomparesan MBBwith a Query

Global MemoryNode A

Node B Node C

Node D Node E Node F Node G

SMPNode E

SPs

Q: ith Query

Recursive Search on GPU simply does not work

• Inherently spatial indexing structures such as R-Trees or KDB-Trees are not well suited for CUDA environment.

• irregular search path and recursion make it hard to maximize the utilization of GPU

– 48K shared memory will overflow when tree height is > 5

• Leftmost search– Choose the leftmost child node no matter how many child nodes

overlap

• Rightmost search– Choose the rightmost child node no matter how many child nodes

overlap

• Parallel Scanning– In between two leaf nodes, perform massively parallel scanning

to filter out non-overlapping data elements.

MPTS (Massively Parallel 3 Phase Scan)

pruned out pruned out

MPTS improvementusing Hilbert Curve

• Hilbert Curve: Continuous fractal space-filling curve – Map multi-dimensional points onto 1D curve

• Recursively defined curve– Hilbert curve of order n is constructed from four copies of

the Hilbert curve of order n-1, properly oriented and connected.

• Spatial Locality Preserving Method– Nearby points in 2D are also close in the 1D

Image source: Wikipedia

first order 2nd order 3rd order


• Hilbert curve is well known for it spatial clustering property. – Sort the data along with Hilbert curve– Cluster similar data nearby– The gap between leftmost leaf node and the rightmost leaf

node would be reduced. – The number of visited nodes would decrease

pruned outpruned out


• Hilbert curve is well known for it spatial clustering property. – Sort the data along with Hilbert curve– Cluster similar data nearby– The gap between leftmost leaf node and the rightmost leaf

node would be reduced. – The number of visited nodes would decrease

Drawback of MPTS

• MPTS reduces the number of leaf nodes to be accessed, but still it accesses a large number of leaf nodes that do not have requested data.

• Hence we designed a variant of R-trees that work on the GPU without stack problem and does not access leaf nodes that do not have requested data.– MPHR-Trees (Massively Parallel Hilbert R-

Trees)

MPHR-tree (Massively Parallel Hilbert R-Tree)Bottom-up construction on the GPU

1. Sort data using Hilbert curve index

MPHR-tree (Massively Parallel Hilbert R-tree)Bottom-up construction on the GPU

2. Build R-trees in a bottom-up fashion

Store maximum Hilbertvalue max along with MBR


2. Build R-trees in a bottom-up fashion

Store maximum Hilbertvalue max along with MBR


• Basic idea– Parallel reduction to generate an MBR of a parent node and

to get a maximum Hilbert value.

R4 R56 26

R644

R7 R847 67

R996

R10 R11105 130

R12159

SMP0 SMP1 SMP2thread[0] … thread[K-1] thread[0] … thread[K-1] thread[0] … thread[K-1]

R1 R244 96

level n

level n+1build the treebottom-upin parallel

R3159

MPHR-tree (Massively Parallel Hilbert R-tree)Searching on the GPU

• Iterate leftmost search and parallel scan using Hilbert curve index– leftmostSearch() visits leftmost search path whose

Hilbert index is greater than the given Hilbert index

R1 R2

159 231

R6 R7

210 231

R3 R4

44 96

R5

159

D1 D2

6 26

D3

44

D4 D5

47 67

D6

96

D7 D8

105 130

D9

159

D10 D11

182 200

D12

210

D13 D14

224 231

1

2

3 4

5

6

7

keep parallel scanningif there exist overlapping leaf nodes

Left-most Search/Find leaf node Left-most Search

level 0

level 1

lastHilbertIndex = 0;while(1){ leftmostLeaf=leftmostSearch(lastHilbertIndex, QueryMBR); if(leftmostLeaf < 0) break; lastHilbertIndex = parallelScan(leftmostLeaf); }

MPTS vs MPHR-Tree

• Search complexity of MPHR-Tree

k is the number of leaf nodes that have requested data

prunedout

prunedout

prunedout

prunedout

prunedout

CkC B Nlog

MPTS MPHR-Trees

Braided Parallelism vs Data Parallelism

• Braided Parallel Indexing– Multiple queries can be processed in parallel.

• Data Parallel Indexing (Partitioned Indexing)– Single query is processed by all the CUDA SMPs – partitioned R-trees

Braided Parallel Indexing Data Parallel Indexing

Performance EvaluationExperimental Setup (MPTS vs MPHR-tree)

• CUDA Toolkit 5.0• Tesla Fermi M2090 GPU card

– 16 SMPs – Each SMP has 32 CUDA cores, which enables 512

(16x32) threads to run concurrently.

• Datasets– 40 millions of 4D point data sets in uniform, normal,

and Zipf's distribution

Performance Evaluation MPHR-tree Construction

• 12 K page (fanouts=256), 128 CUDA blocks X64 threads per block

• It takes only 4 seconds to build R-trees with 40 millions of data while CPU takes more than 40 seconds. ( 10x speed up )– Without including memory transfer time, it takes only 50 msec.

(800x speed up)

Performance Evaluation MPTS Search vs MPES Search

• 12K page (fanouts=256), 128 CUDA blocks X64 threads per block, selection ratio = 1%

• MPTS outperforms MPES and R-trees on Xeon E5506 (8cores)– In high dimensions, MPTS accesses more memory blocks but

the number of instructions executed by a warp is smaller than MPES

Performance Evaluation MPHR-tree Search

• 12 K page (fanouts=256), 128 CUDA blocks X64 threads per block

• MPHR-tree consistently outperforms other indexing methods – In terms of throughput, braided MPHR-Trees shows an order of magnitude

higher performance than multi-core R-trees and MPES.

– In terms of query response time, partitioned MPHR-trees shows an order of magnitude faster performance than multi-core R-trees and MPES.

Performance Evaluation MPHR-tree Search

• In cluster environment, MPHR-Trees show an order of magnitude higher throughput than LBNL FastQuery library. – LBNL FastQuery is a parallel bitmap indexing library for multi-core

architectures.

Summary

• Brute-force parallel methods can be refined with more sophisticated parallel algorithms.

• We proposed new parallel tree traversal algorithms and showed they significantly outperform the traditional recursive access to hierarchical tree structures.

Q&A

• Thank You

MPTS improvementusing Sibling Check

• When a current node doesn’t have any overlapping children, check sibling nodes!– It’s always better to prune out tree nodes in upper level.

CUDA

• GPGPU (General Purpose Graphics Processing Unit)– CUDA is a set of developing tools to create

applications that will perform execution on GPU– GPUs allow creation of very large number of

concurrently executed threads at very low system resource cost.

– CUDA also exposes fast shared memory (48KB) that can be shared between threads.

Image source: Wikipedia

Tesla M2090 : 16X32 = 512 cores

Grids and Blocks of CUDA Threads

• A kernel is executed as a grid of thread blocks

– All threads share data memory space

• A thread block is a batch of threads that can cooperate with each other by:

– Synchronizing their execution• For hazard-free shared

memory accesses– Efficiently sharing data

through a low latency shared memory

• Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NVIDIA

Multi-dimensional Range Query Processing on the GPU

Documents