Page 1
Introduction CUDA Error Free Transformations Implementation Experimental Results
Correctly Rounded Dot Product in CUDA
Alexander Dallmann, Fabian Jorg, Marco Nehmeier andJurgen Wolff von Gudenberg
Chair of Computer Science IISoftware EngineeringUniversitat Wurzburg
Germany
SWIM 2014
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 1/38
Page 2
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 2/38
Page 3
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 3/38
Page 4
Introduction CUDA Error Free Transformations Implementation Experimental Results
MotivationComputer Architecture
A change in the design of computer architecture
Single-core ⇒ multi-coreCircumvent physical constraintsParallelize the computation
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 4/38
Page 5
Introduction CUDA Error Free Transformations Implementation Experimental Results
MotivationGraphic Devices
GPGPU
Performance
Highly parallel
CUDA (Compute Unified Device Architecture)
NVIDIA GPU’s
OpenCL (Open Computing Language)
Open standardC like programming languageCPU’s, GPU’s, and many other
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 5/38
Page 6
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 6/38
Page 7
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDACompute Unified Device Architecture
GPU architecture for General Purpose Computing
PCI Express add-in boards
Highly parallelSeveral streaming multiprocessors (SM)
Each consisting of several cores
Compute capability = classification
Usable like a many-core CPU
CUDA C
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 7/38
Page 8
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDA Thread Hierarchy
Grid = group of threadblocks
Thread block = group ofthreads running on oneSM
Warp = 32 threads of athread block
SIMT (SingleInstruction MultipleThreads)Fundamental unit
Block(0,0)
Block(0,1) Block(1,1)
Block(1,0) Block(2,0)
Block(2,1)
Grid
Thread (1,0) Thread (2,0) Thread (3,0)
Thread (3,1)Thread (2,1)Thread (1,1)Thread (0,1)
Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)
Thread (3,3)Thread (2,3)Thread (1,3)Thread (0,3)
Block (1,1)
Thread (0,0)
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 8/38
Page 9
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDA Memory Model
Memory Access Scope Lifetime
Register R/W 1 thread ThreadLocal R/W 1 thread ThreadShared R/W All threads in block BlockGlobal R/W All threads + host Host allocationConstant R All threads + host Host allocationTexture R All threads + host Host allocation
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 9/38
Page 10
Introduction CUDA Error Free Transformations Implementation Experimental Results
CUDA Processing flow
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 10/38
Page 11
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 11/38
Page 12
Introduction CUDA Error Free Transformations Implementation Experimental Results
Error Free Transformations
f1 f2
a + b
s = fl(a + b)
r
Error of the summation of two floating point numbers isalso a floating point number
Computable by a simple algorithm
(s, r) = twoSum(a, b):
s = a + b
a’ = s - b
b’ = s - a’
δa = a - a’
δb = b - b’
r = δa + δb
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 12/38
Page 13
Introduction CUDA Error Free Transformations Implementation Experimental Results
twoProduct
(s, r) = twoProduct (a, b):
s = a · b
(a1, a2) = split (a)
(b1, b2) = split (b)
r = a2 · b2 - (((s - a1 · b1) - a2 · b1 ) - a1 · b2)
(x,y) = split (a):
factor = 2s + 1c = factor · a
x = c - (c - a)
y = a - x
(s, r) = twoProduct_fast (a, b):
s = a · b
r = fma(a, b, -s)
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 13/38
Page 14
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 14/38
Page 15
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
Array of 67 * 64 Bit (unsigned long long int)
32 Bit data32 Bit overflow
2 accumulators
positivenegative
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 15/38
Page 16
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
addFloatToAccuDev
Device function
Atomic add
Memory accesses
double: max. 3float: max. 2
SIMT
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 16/38
Page 17
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
addFloatsToAccuKernel
1 Allocate memory on device
2 Add all floats to the accumulators (positive and negative)
addFloatToAccuDev
3 Propagate carry bits
4 Positive accu + negative accu
5 Determine exactly rounded floating point number
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 17/38
Page 18
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
Propagate carry bits
1100101010110101
1100101010110101
1100101101111111
0001000010000000
0001000101001011
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 18/38
Page 19
Introduction CUDA Error Free Transformations Implementation Experimental Results
Long Accumulator
Exactly rounded floating point number
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 19/38
Page 20
Introduction CUDA Error Free Transformations Implementation Experimental Results
EFT accumulation
addArrayWithCuda
Compute sum of n floating point numbers
Iterative execution
Repetitive Kernel executionsTree-like
Parameters
inputs
errors
maxExp
numErrNonZero
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 20/38
Page 21
Introduction CUDA Error Free Transformations Implementation Experimental Results
EFT accumulation
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 21/38
Page 22
Introduction CUDA Error Free Transformations Implementation Experimental Results
EFT accumulationerror handling
errorFreeSum
1 Compute sum S and store result in inputs. Store errorsin errors.
2 Compute the sum E of all errors and add E to S . Storeerrors in errors.
3 Estimate the the remaining error.
numErrNonZero ∗ 2maxExp+1≥
∣
∣
∣
∑
ei
∣
∣
∣, ei ∈ errors
4 Go to 2 if remaining error has influence to the sum.
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 22/38
Page 23
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot product
a · b =n−1∑
i=0
ai · bi
1 Compute the products ai · bi using
twoProduct
twoProduct fast
2 Accumulate the products using
long accumulatorerrorFreeSum
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 23/38
Page 24
Introduction CUDA Error Free Transformations Implementation Experimental Results
Outline
1 Introduction
2 CUDA
3 Error Free Transformations
4 Implementation
5 Experimental Results
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 24/38
Page 25
Introduction CUDA Error Free Transformations Implementation Experimental Results
Test System
Intel Core i5-2500K 3.30 GHz
NVIDIA Geforce GTX 760
Compute capability 3.01152 CUDA cores
Windows 7 Professional (64 Bit)
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 25/38
Page 26
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input const ThreadsFloat
512 Threads * 512 Blocks = 262144 Threads
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 26/38
Page 27
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input const ThreadsDouble
512 Threads * 512 Blocks = 262144 Threads
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 27/38
Page 28
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input inc ThreadsFloat
Threads / Input = 8
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 28/38
Page 29
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc Input inc ThreadsDouble
Threads / Input = 8
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 29/38
Page 30
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc pos Input inc ThreadsFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 30/38
Page 31
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc pos Input inc ThreadsDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 31/38
Page 32
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc small pos Input inc ThreadsFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 32/38
Page 33
Introduction CUDA Error Free Transformations Implementation Experimental Results
Inc small pos Input inc ThreadsDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 33/38
Page 34
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot ProductFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 34/38
Page 35
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot ProductDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 35/38
Page 36
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot Product FastFloat
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 36/38
Page 37
Introduction CUDA Error Free Transformations Implementation Experimental Results
Dot Product FastDouble
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 37/38
Page 38
Introduction CUDA Error Free Transformations Implementation Experimental Results
Questions ?
Marco [email protected]
Marco Nehmeier — Correctly Rounded Dot Product in CUDA 38/38