Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Floating-Point Data Compressionat 75 Gb/s on a GPU

Molly A. O’Neil and Martin BurtscherDepartment of Computer Science

Introduction Scientific simulations on HPC clusters

Run on interconnected compute nodes Produce and transfer lots of floating-point data Data storage and transfer are expensive and slow Compute nodes have multiple cores but only one link

Interconnects are getting faster Lonestar: 40 Gb/s InfiniBand Speeds of up to 100 Gb/s soon

Floating-Point Data Compression at 75 Gb/s on a GPU

Texas Advanced Computing Center

March 2011

Introduction (cont.) Compression

Reduced storage, faster transfer Only useful when done in real time

Saturate network with compressed data Requires compressor tailored to hardware capabilities

GFC algorithm for IEEE 754 double-precision data Designed specifically for GPU hardware (CUDA) Provides reasonable compression ratio and operates

above throughput of emerging networks


Charles Trevelyan for http://plus.maths.org/

March 2011

Lossless Data Compression Dictionary-based (Lempel-Ziv family) [gzip, lzop]

Variable-length entropy coders (Huffman, AC) Run-length encoding [fax]

Transforms (Burrows-Wheeler) [bzip2]

Special-purpose FP compressors [FPC, FSD, PLMI] Prediction and leading-zero suppression

None of these offer real-time speeds forstate-of-the-art networks

Floating-Point Data Compression at 75 Gb/s on a GPU March 2011

GFC Algorithm GPUs require 1000s of parallel activities, but…

compression is a generally serial operation


Divide data into n chunks, processed in parallel Best perf: choose n to match

max number of resident warps

Each chunk composed of 32-word subchunks One double per warp thread Use previous subchunk to

provide prediction values

March 2011

Dimensionality Many scientific data sets display dimensionality

Interleaved coordinates from multiple dimensions

Optional dimensionality parameter to GFC Determines index of previous subchunk to use as the

prediction


GFC Algorithm (cont.)


GPU Optimizations Low thread divergence (few if statements)

Some short enough to be predicated

Coalesce memory accesses by packing/unpacking data in shared memory (for CC < 2.0)

Very little inter-threadcommunication and synchronization Prefix sum only Warp-based implementation


gamedsforum.ca

March 2011

Evaluation Method Systems

Two quad-core 2.53 GHz Xeons NVIDIA FX 5800 GPU (CC 1.3)

13 datasets: real-world data (19 – 277 MB) Observational data, simulation results, MPI messages

Comparisons Compression ratio vs. 5 compressors in common use Throughput vs. pFPC (fastest known CPU compressor)


Compression Ratio 1.188 (range: 1.01 – 3.53)

Low (FP data), but in line with other algos

Largely independent of number of chunks

When done in real-time, compression at this ratio can greatly speed up MPI apps 3% – 98% speed-up

[Ke et al., SC’04]


1.00

1.05

1.10

1.15

1.20

1.25

bzip2 gzip lzop FPC pFPC GFC

Har

mon

ic m

ean

com

pres

sion

ratio

March 2011

Throughput C: 75 – 87 Gb/s

Mean: 77.9 Gb/s

D: 90 – 121 Gb/s Mean: 96.6 Gb/s

4x faster than pFPC on 8 cores (2 CPUs)

Improvement over pFPC’s compression ratio vs. performance trend


0

10

20

30

40

50

60

70

80

90

100

1.15 1.20 1.25 1.30 1.35 1.40

Har

mon

ic m

ean

thro

ughp

ut (

Gb/

s)

Harmonic mean compression ratio

compression

decompression

GFC

pFPC

March 2011

NEW: Fermi Throughput Fermi improvements:

Faster, simpler memory accesses Hardware support for count-

leading-zeros op

Compression ratio: 1.187

C: 119 – 219 (HM: 167.5 Gb/s) D: 169 – 219 (HM: 180.3 Gb/s)

Compresses over 9.5x faster than pFPC on 8 x86 cores


0

50

100

150

200

Thro

ughp

ut (G

b/s)

compression decompression

March 2011

Summary GFC algorithm

Chunks up data, each warp processes a chunk iteratively by 32-word subchunks

No communication required between warps

Minimum 75 Gb/s – 90 Gb/s (encode-decode) throughput on GTX-285, and 119 Gb/s – 169 Gb/s on Fermi, with a compression ratio of 1.19

CUDA source code is freely available athttp://www.cs.txstate.edu/~burtscher/research/GFC/


Conclusions GPU can compress much faster than PCIe bus can

transfer the data But…

PCIe bus will become faster CPU-GPU increasingly on single die GPU-to-GPU, GPU-to-NIC transfers coming?

GFC is the first compressor with the potential to deliver real-time FP data compression for current and emerging network speeds


AMD

NVIDIA

March 2011

Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.

Documents

realtime fp data compression

compressed data

realworld data

packingunpacking data

gpu divide data

gbs mean

gbs compresses

pfpcs compression ratio