Top Banner
Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science
14

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Apr 01, 2015

Download

Documents

Omar Temple
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Floating-Point Data Compressionat 75 Gb/s on a GPU

Molly A. O’Neil and Martin BurtscherDepartment of Computer Science

Page 2: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Introduction Scientific simulations on HPC clusters

Run on interconnected compute nodes Produce and transfer lots of floating-point data Data storage and transfer are expensive and slow Compute nodes have multiple cores but only one link

Interconnects are getting faster Lonestar: 40 Gb/s InfiniBand Speeds of up to 100 Gb/s soon

Floating-Point Data Compression at 75 Gb/s on a GPU

Texas Advanced Computing Center

March 2011

Page 3: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Introduction (cont.) Compression

Reduced storage, faster transfer Only useful when done in real time

Saturate network with compressed data Requires compressor tailored to hardware capabilities

GFC algorithm for IEEE 754 double-precision data Designed specifically for GPU hardware (CUDA) Provides reasonable compression ratio and operates

above throughput of emerging networks

Floating-Point Data Compression at 75 Gb/s on a GPU

Charles Trevelyan for http://plus.maths.org/

March 2011

Page 4: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Lossless Data Compression Dictionary-based (Lempel-Ziv family) [gzip, lzop]

Variable-length entropy coders (Huffman, AC) Run-length encoding [fax]

Transforms (Burrows-Wheeler) [bzip2]

Special-purpose FP compressors [FPC, FSD, PLMI] Prediction and leading-zero suppression

None of these offer real-time speeds forstate-of-the-art networks

Floating-Point Data Compression at 75 Gb/s on a GPU March 2011

Page 5: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

GFC Algorithm GPUs require 1000s of parallel activities, but…

compression is a generally serial operation

Floating-Point Data Compression at 75 Gb/s on a GPU

Divide data into n chunks, processed in parallel Best perf: choose n to match

max number of resident warps

Each chunk composed of 32-word subchunks One double per warp thread Use previous subchunk to

provide prediction values

March 2011

Page 6: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Dimensionality Many scientific data sets display dimensionality

Interleaved coordinates from multiple dimensions

Optional dimensionality parameter to GFC Determines index of previous subchunk to use as the

prediction

Floating-Point Data Compression at 75 Gb/s on a GPU March 2011

Page 7: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

GFC Algorithm (cont.)

Floating-Point Data Compression at 75 Gb/s on a GPU March 2011

Page 8: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

GPU Optimizations Low thread divergence (few if statements)

Some short enough to be predicated

Coalesce memory accesses by packing/unpacking data in shared memory (for CC < 2.0)

Very little inter-threadcommunication and synchronization Prefix sum only Warp-based implementation

Floating-Point Data Compression at 75 Gb/s on a GPU

gamedsforum.ca

March 2011

Page 9: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Evaluation Method Systems

Two quad-core 2.53 GHz Xeons NVIDIA FX 5800 GPU (CC 1.3)

13 datasets: real-world data (19 – 277 MB) Observational data, simulation results, MPI messages

Comparisons Compression ratio vs. 5 compressors in common use Throughput vs. pFPC (fastest known CPU compressor)

Floating-Point Data Compression at 75 Gb/s on a GPU March 2011

Page 10: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Compression Ratio 1.188 (range: 1.01 – 3.53)

Low (FP data), but in line with other algos

Largely independent of number of chunks

When done in real-time, compression at this ratio can greatly speed up MPI apps 3% – 98% speed-up

[Ke et al., SC’04]

Floating-Point Data Compression at 75 Gb/s on a GPU

1.00

1.05

1.10

1.15

1.20

1.25

bzip2 gzip lzop FPC pFPC GFC

Har

mon

ic m

ean

com

pres

sion

ratio

March 2011

Page 11: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Throughput C: 75 – 87 Gb/s

Mean: 77.9 Gb/s

D: 90 – 121 Gb/s Mean: 96.6 Gb/s

4x faster than pFPC on 8 cores (2 CPUs)

Improvement over pFPC’s compression ratio vs. performance trend

Floating-Point Data Compression at 75 Gb/s on a GPU

0

10

20

30

40

50

60

70

80

90

100

1.15 1.20 1.25 1.30 1.35 1.40

Har

mon

ic m

ean

thro

ughp

ut (

Gb/

s)

Harmonic mean compression ratio

compression

decompression

GFC

pFPC

March 2011

Page 12: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

NEW: Fermi Throughput Fermi improvements:

Faster, simpler memory accesses Hardware support for count-

leading-zeros op

Compression ratio: 1.187

C: 119 – 219 (HM: 167.5 Gb/s) D: 169 – 219 (HM: 180.3 Gb/s)

Compresses over 9.5x faster than pFPC on 8 x86 cores

Floating-Point Data Compression at 75 Gb/s on a GPU

0

50

100

150

200

Thro

ughp

ut (G

b/s)

compression decompression

March 2011

Page 13: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Summary GFC algorithm

Chunks up data, each warp processes a chunk iteratively by 32-word subchunks

No communication required between warps

Minimum 75 Gb/s – 90 Gb/s (encode-decode) throughput on GTX-285, and 119 Gb/s – 169 Gb/s on Fermi, with a compression ratio of 1.19

CUDA source code is freely available athttp://www.cs.txstate.edu/~burtscher/research/GFC/

Floating-Point Data Compression at 75 Gb/s on a GPU March 2011

Page 14: Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Conclusions GPU can compress much faster than PCIe bus can

transfer the data But…

PCIe bus will become faster CPU-GPU increasingly on single die GPU-to-GPU, GPU-to-NIC transfers coming?

GFC is the first compressor with the potential to deliver real-time FP data compression for current and emerging network speeds

Floating-Point Data Compression at 75 Gb/s on a GPU

AMD

NVIDIA

March 2011