Top Banner
FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey SIGMOD 2010 Presented by: Andy Hwang
26

FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Jun 07, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey

SIGMOD 2010

Presented by: Andy Hwang

Page 2: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

• Index trees are not optimized for architecture

• Only one node is accessed per tree level, ineffective cache line utilization • Prefetch cannot be used (depends on comparison of

search key to parent)

• Nodes in different pages, causing TLB misses

• Previous work optimized for page, cache, SIMD separately, not together

• Compression can be used to save memory bandwidth

2

Page 3: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation: Index Tree Layout 3

Bad for traversal

Page 4: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

Hierarchical Blocking

CPU/GPU Implementation

Compression

Throughput/Response Time

Summary/Discussion

4

Page 5: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Hierarchical Blocking 5

Optimize for accesses (SIMD/cache/memory)

Page 6: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Hierarchical Blocking 6

Page 7: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

Hierarchical Blocking

CPU/GPU Implementation

Compression

Throughput/Response Time

Summary/Discussion

7

Page 8: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Construction

• Assuming 4-byte keys (32-bits)

• Block size depends on SIMD instruction width, cache line size, and page size

• Use one SIMD instruction to calculate multiple indices

• Parallelize output amongst CPU cores / GPU shared multiprocessors

8

Page 9: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Construction: CPU

• 128-bit SIMD = max 4 nodes at once

• SIMD block = 2 tree levels (3 nodes)

• 64-byte cache line = max 16 nodes

• Cache line block = 4 levels (15 nodes)

• 2MB page size

• Page block = 19 levels

• 4KB page = 10 levels

9

Page 10: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Construction: GPU

• 32 data elements (thread warp)

• Various SIMD block sizes possible (up to 32)

• Set depth to 4 to make use of instruction granularity at half-warp

• No cache exposed – cache line block size set equal to SIMD block size

10

Page 11: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: CPU 11

Page 12: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: GPU 12

Page 13: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Simultaneous Queries

• Issue queries in parallel on the hardware

• Software pipelining used to hide cache/TLB miss or GPU memory latency

• CPU: 8 concurrent queries per thread, 64 total

• GPU: 2 concurrent queries per thread warp, 960 total

13

Page 14: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Optimization Speedup 14

Page 15: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

CPU vs GPU Search Throughput 15

Page 16: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: MICA

• Intel Many-Core Architecture Platform

• Intel GPGPU effort

• 32KB L1, 256KB L2 (partitioned)

• 4 threads/core

• Traversal code similar to CPU

• 16-wide SIMD

• SIMD block depth = 4 (15 nodes at once)

16

Page 17: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Tree Traversal: MICA

Throughput (million queries / sec)

Small Tree (64K keys) Large Tree (16M keys)

CPU 280 60

GPU 150 100

MICA 667 183

17

Benefits of both CPU and GPU!

Page 18: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Motivation

Hierarchical Blocking

CPU/GPU Implementation

Compression

Throughput/Response Time

Summary/Discussion

18

Page 19: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression

• Key sizes are different in practice

• Impact cache line and page usage

• Non-Contiguous Common Prefix

• Hashing keys based on their difference (partial keys)

• 4-bit blocks as unit of compression

• SIMD instruction to find similarity and compress

19

Page 20: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression

• First page partial key size is larger (128 bits) to reduce false positives

• Subsequent pages have partial key size 32

• Construction overhead increased

• +75% for variable size keys, +30% integer keys

• During traversal, the query key is compressed

20

Page 21: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression 21

Page 22: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression: Alphabet Size 22

Page 23: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Compression: Throughput 23

Page 24: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Query Batching/Buffering 24

Page 25: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Summary

• Hierarchical blocking to optimize search tree for page, cache, SIMD instructions • Architectural-aware block depths

• CPU/GPU/MICA implementations • Fast construction, search, and parallel queries for

varying tree sizes

• Hide memory latency wherever possible • NCCP compression for integer and variable length

keys • Throughput/Response time for different query

batching schemes

25

Page 26: FAST: Fast Architecture Sensitive Tree Search on Modern ...ryanjohn/teaching/csc2531-f11/slides/Andy-FA… · FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

Discussion

• Focus on throughput

• Assumes large number of queries

• Not much info on latency

• Updates

• Full reconstruction? Flushed from cache?

• Synthetic workloads

• Deployment

26