Parallel White Noise Generation on a GPU via Cryptographic Hash Stanley Tzeng Li-Yi Wei Microsoft Research Asia
Dec 19, 2015
Parallel White Noise Generationon a GPU via Cryptographic HashParallel White Noise Generation
on a GPU via Cryptographic Hash
Stanley Tzeng Li-Yi Wei
Microsoft Research Asia
What is White Noise?What is White Noise?
Spatial domain: uniform random number
Frequency domain: white noise
spatial domain frequency domain
ImportanceImportance
Mother of all random numbers
Commonly used, e.g. rand() in C/C++
Major algorithms sequential
e.g. xn = a xn-1 + b mod c
Processors are becoming parallel
GPU, multi-core CPU, Cell
sequential algorithms cannot leverage that
ContributionContribution
☺Parallel algorithm for white noises
independent evaluation for every sample
easy implementation as a GPU pixel shader
speed faster than sequential algorithms
quality same or better
usage similar to texture mapping
PRNG (Pseudo Random Number Generator)PRNG (Pseudo Random Number Generator)
The main source of randomness in programs
Desirable properties
white noise statistics
repeatable
fast computation
low memory usage
Core IdeaCore Idea
1. input trivially prepared in parallel, e.g. linear ramp
2. feed input value into hash, independently and in parallel
3. output white noise
key idea:
borrow cryptographic hash!
hash
input
output
Cryptographic HashCryptographic Hash
A subclass of hash
Commonly used for security applications
e.g. password, digital signature
Properties
irreversible – cannot find input from hash output
decorrelating – similar inputs, dissimilar outputs
uniform probability – all outputs likely to occur
Cryptographic Hash - ExampleCryptographic Hash - Example
irreversible, decorrelating, uniform probability
CHash ("The quick brown fox jumps over the lazy dog") = 9e107d9d372bb6826bd81d3542a419d6
CHash ("The quick brown fox jumps over the lazy eog")
= ffd93f16876049265fbaef4da268dd0e
Cryptographic Hash as a PRNGCryptographic Hash as a PRNG
White noise statistics
CHash is cryptographically secure
Repeatable
CHash is invariant with same input
Fast computation
CHash is parallel + constant cost
Low memory usage
CHash maintains no state
Order-independent i.e. Random accessible
important for parallel GPU applications
hash
Which Cryptographic Hash?Which Cryptographic Hash?
Many options
MD5, SHA, RIPEMD, Tiger, block cipher, etc
Desirable properties
white noise quality
fast computation
power-of-2 aligned (output & operations)
pure pixel shader, no state maintenance
Our Hash of Choice: MD5 [Rivest 1992]Our Hash of Choice: MD5 [Rivest 1992]
128-bit outputs and 32-bit operation
Small number of constants fit entirely in shader
Fastest among those satisfying quality criteria
Not 100% secure [Wang and Yu 2005]
but good enough for our goal
MD5 Algorithm OverviewMD5 Algorithm Overview
InputScrambling
(bit op, table, arithmetic) Outputshift table sin table
64 rounds
Performance Bottlenecks for Pixel ShaderPerformance Bottlenecks for Pixel Shader
InputScrambling
(bit op, table, arithmetic) Outputshift table sin table
64 rounds
Our OptimizationOur Optimization
InputScrambling
(bit op, table, arithmetic) Outputshift table sin table
64 rounds
sin functionreducedshift table
loop unrolling
Previous PRNGPrevious PRNG
GPU
BBS [Blum et al. 1986, Olano 2005]
O extremely fast
X not good quality
CEICG [Entacher et al. 1998, Sussman et al. 2006]
O decent quality
X processing time varies
AES [NIST 2001, Yamanouchi 2007]
O invertible (not hash)
X not good quality
CPU
rand
O commonly used
X not good quality
drand48
O better quality
X slower
Mersenne Twister [Matsumoto and Nishimura 1998]
O high quality and fast
X not random accessible
Assessing Quality: DIEHARD [Marsaglia 1995]Assessing Quality: DIEHARD [Marsaglia 1995]
De facto standard on measuring PRNG quality
Runs 15 different tests on the bits generated
Outputs p-val. If p == 0 || p == 1, fail.
BIRTHDAY SPACINGS TEST, M= 512 N=2**24 LAMBDA= 2.0000 Results for aes.bin
For a sample of size 500: mean aes.bin using bits 1 to 24 2.036
duplicate number number spacings observed expected
0 66. 67.668 1 130. 135.335 2 148. 135.335 3 80. 90.224 4 44. 45.112 5 20. 18.045
6 to INF 12. 8.282 Chisquare with 6 d.o.f. = 4.50 p-value= .391147
Cumulative Distribution FunctionCumulative Distribution Function
Shows how data is distributed within set
Given x in data, what % of data values are ≤ x
0 %
100% 100 %
1X=0 1X=0
0 %
Normal Distribution Uniform Distribution
Kolmogorov-Smirnov TestKolmogorov-Smirnov Test
Determines how two sets of data are alike
Looks at max difference D between distribution functions
100 %
1X=0
0 %
100 %
1X=0
0 %
not alike alike
D
D
Assessing Quality: DIEHARDAssessing Quality: DIEHARD
Run the results of the DIEHARD test (p-value) through a KS-test. Look at D-value.
Uniform Distribution Curve
P-value Curve
D-Value
Cumulative Distribution Function
D Smaller D is better quality!0
100
Assessing Quality: Power SpectrumAssessing Quality: Power Spectrum
Radial mean: should be uniform
Radial variance: should be low & uniform
Power spectrum density Radial mean Radial variance (Anisotropy)
Assessing Speed: Batch RenderingAssessing Speed: Batch Rendering
Clock time to generate random bits
n2 x 128 bits image, n = 512, 1024, 2048 and 4096
n2
n2
Assessing Speed: Texture Subset(For random accessibility)Assessing Speed: Texture Subset(For random accessibility)
A huge virtual texture
clock time for access A B
measure difference
(smaller is better)
220
220
A
B
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
MD5GPU GPU CEICG GPU BBS GPU AES rand drand48 M. Twister
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DIEHARD TESTS PASSED DIEHARD TEST D-VALUE
Test Results: DIEHARD ResultsTest Results: DIEHARD Results
the higher the better the lower the better
Test Results: Batch Render SpeedTest Results: Batch Render Speed
0
10
20
30
40
50
60
MD5CPU MD5GPUref MD5GPUopt GPU CEICG GPU BBS GPU AES rand drand48 M. Twister
fps
512 1024 2048 4096
Test Results: Texture Subset SpeedTest Results: Texture Subset Speed
Texture Subset Difference
3.1
0 0
4.8
0 0
362776257001.9
1
10
100
1000
10000
100000
1000000
MD5CPU MD5GPUref MD5GPUopt GPU CEICG GPU BBS GPU AES rand drand48 M. Twister
(ms)
Trading Quality for SpeedTrading Quality for Speed
Reducing # of rounds
O faster speed
X lower quality
Rounds Time(ms)DIEHARD tests
passedKS D-Val
64 6.3 15/15 0.2029
48 4.7 14/15 0.2042
32 3.1 13/15 0.2295
16 1.6 13/15 0.253
Future WorkFuture Work
Implement our method in hardware
very similar to texture unit but much smaller
(no need for cache)
Alternative hashes
ride with advances in cryptographic hash