Floating-Point Data Compression at 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science.
Post on 01-Apr-2015
218 Views
Preview:
Transcript
Floating-Point Data Compressionat 75 Gb/s on a GPU
Molly A. O’Neil and Martin BurtscherDepartment of Computer Science
Introduction Scientific simulations on HPC clusters
Run on interconnected compute nodes Produce and transfer lots of floating-point data Data storage and transfer are expensive and slow Compute nodes have multiple cores but only one link
Interconnects are getting faster Lonestar: 40 Gb/s InfiniBand Speeds of up to 100 Gb/s soon
Floating-Point Data Compression at 75 Gb/s on a GPU
Texas Advanced Computing Center
March 2011
Introduction (cont.) Compression
Reduced storage, faster transfer Only useful when done in real time
Saturate network with compressed data Requires compressor tailored to hardware capabilities
GFC algorithm for IEEE 754 double-precision data Designed specifically for GPU hardware (CUDA) Provides reasonable compression ratio and operates
above throughput of emerging networks
Floating-Point Data Compression at 75 Gb/s on a GPU
Charles Trevelyan for http://plus.maths.org/
March 2011
Lossless Data Compression Dictionary-based (Lempel-Ziv family) [gzip, lzop]
Variable-length entropy coders (Huffman, AC) Run-length encoding [fax]
Transforms (Burrows-Wheeler) [bzip2]
Special-purpose FP compressors [FPC, FSD, PLMI] Prediction and leading-zero suppression
None of these offer real-time speeds forstate-of-the-art networks
Floating-Point Data Compression at 75 Gb/s on a GPU March 2011
GFC Algorithm GPUs require 1000s of parallel activities, but…
compression is a generally serial operation
Floating-Point Data Compression at 75 Gb/s on a GPU
Divide data into n chunks, processed in parallel Best perf: choose n to match
max number of resident warps
Each chunk composed of 32-word subchunks One double per warp thread Use previous subchunk to
provide prediction values
March 2011
Dimensionality Many scientific data sets display dimensionality
Interleaved coordinates from multiple dimensions
Optional dimensionality parameter to GFC Determines index of previous subchunk to use as the
prediction
Floating-Point Data Compression at 75 Gb/s on a GPU March 2011
GFC Algorithm (cont.)
Floating-Point Data Compression at 75 Gb/s on a GPU March 2011
GPU Optimizations Low thread divergence (few if statements)
Some short enough to be predicated
Coalesce memory accesses by packing/unpacking data in shared memory (for CC < 2.0)
Very little inter-threadcommunication and synchronization Prefix sum only Warp-based implementation
Floating-Point Data Compression at 75 Gb/s on a GPU
gamedsforum.ca
March 2011
Evaluation Method Systems
Two quad-core 2.53 GHz Xeons NVIDIA FX 5800 GPU (CC 1.3)
13 datasets: real-world data (19 – 277 MB) Observational data, simulation results, MPI messages
Comparisons Compression ratio vs. 5 compressors in common use Throughput vs. pFPC (fastest known CPU compressor)
Floating-Point Data Compression at 75 Gb/s on a GPU March 2011
Compression Ratio 1.188 (range: 1.01 – 3.53)
Low (FP data), but in line with other algos
Largely independent of number of chunks
When done in real-time, compression at this ratio can greatly speed up MPI apps 3% – 98% speed-up
[Ke et al., SC’04]
Floating-Point Data Compression at 75 Gb/s on a GPU
1.00
1.05
1.10
1.15
1.20
1.25
bzip2 gzip lzop FPC pFPC GFC
Har
mon
ic m
ean
com
pres
sion
ratio
March 2011
Throughput C: 75 – 87 Gb/s
Mean: 77.9 Gb/s
D: 90 – 121 Gb/s Mean: 96.6 Gb/s
4x faster than pFPC on 8 cores (2 CPUs)
Improvement over pFPC’s compression ratio vs. performance trend
Floating-Point Data Compression at 75 Gb/s on a GPU
0
10
20
30
40
50
60
70
80
90
100
1.15 1.20 1.25 1.30 1.35 1.40
Har
mon
ic m
ean
thro
ughp
ut (
Gb/
s)
Harmonic mean compression ratio
compression
decompression
GFC
pFPC
March 2011
NEW: Fermi Throughput Fermi improvements:
Faster, simpler memory accesses Hardware support for count-
leading-zeros op
Compression ratio: 1.187
C: 119 – 219 (HM: 167.5 Gb/s) D: 169 – 219 (HM: 180.3 Gb/s)
Compresses over 9.5x faster than pFPC on 8 x86 cores
Floating-Point Data Compression at 75 Gb/s on a GPU
0
50
100
150
200
Thro
ughp
ut (G
b/s)
compression decompression
March 2011
Summary GFC algorithm
Chunks up data, each warp processes a chunk iteratively by 32-word subchunks
No communication required between warps
Minimum 75 Gb/s – 90 Gb/s (encode-decode) throughput on GTX-285, and 119 Gb/s – 169 Gb/s on Fermi, with a compression ratio of 1.19
CUDA source code is freely available athttp://www.cs.txstate.edu/~burtscher/research/GFC/
Floating-Point Data Compression at 75 Gb/s on a GPU March 2011
Conclusions GPU can compress much faster than PCIe bus can
transfer the data But…
PCIe bus will become faster CPU-GPU increasingly on single die GPU-to-GPU, GPU-to-NIC transfers coming?
GFC is the first compressor with the potential to deliver real-time FP data compression for current and emerging network speeds
Floating-Point Data Compression at 75 Gb/s on a GPU
AMD
NVIDIA
March 2011
top related