Accuracy Accuracy Robert Strzodka Robert Strzodka
Jan 15, 2016
AccuracyAccuracy
Robert StrzodkaRobert Strzodka
2
OverviewOverview
• Precision and Accuracy
• Hardware Resources
• Mixed Precision Iterative Refinement
3
Roundoff and CancellationRoundoff and Cancellation
Roundoff examples for the float s23e8 format
additive roundoff a= 1 + 0.00000004 =fl 1multiplicative roundoff b= 1.0002 * 0.9998 =fl 1cancellation c=a,b (c-1) * 108 =fl 0
Cancellation promotes the small error 0.00000004to the absolute error 4 and a relative error 1.
Order of operations can be crucial: 1 + 0.00000004 – 1=fl 0 1 – 1 + 0.00000004=fl 0.00000004
4
More PrecisionMore Precision
float s23e8 1.1726double s52e11 1.17260394005318long double s63e15 1.172603940053178631
This is all wrong, even the sign is wrong!!
-0.82739605994682136814116509547981629…
The correct result is
Lesson learnt: Computational Precision ≠ Accuracy of Result
)2/(5.5)212111()75.333(),( 8422262 yxyyyxxyxyxf
Evaluating (with powers as multiplications) [S.M. Rump, 1988]
33096,77617 00 yxfor gives
5
Precision and AccuracyPrecision and Accuracy
• There is no monotonic relation between the computational precision and the accuracy of the final result.
• Increasing precision can decrease accuracy !
• Even when one can prove positive effects of increased precision, it is very difficult to quantify them.
• We often simply rely on the experience that increased precision helps in common cases.
• But for common cases we need high precision only in very few places to obtain the desired accuracy.
6
OverviewOverview
• Precision and Accuracy
• Hardware Resources
• Mixed Precision Iterative Refinement
7
Resources for Signed Integer OperationsResources for Signed Integer Operations
Operation Area Latency
min(r,0)
max(r,0)b+1 2
add(r1,r2)
sub(r1,r2)2b b
add(r1,r2,r3)add(r4,r5) 2b 1mult(r1,r2)
sqr(r) b(b-2) b ld(b)
sqrt(r) 2c(c-5) c(c+3)b: bitlength of argument, c: bitlength of result
8
Arithmetic Area Consumption on a FPGAArithmetic Area Consumption on a FPGA
0
200
400
600
800
1000
1200
1400
1600
20 25 30 35 40 45 50
Nu
mb
er
of sl
ice
s
Bits of mantissa
Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)
AdderMultiplier
CG kernel normalized (1/30)
9
Higher Precision EmulationHigher Precision Emulation
• Given a m x m bit unsigned integer multiplier we want to build a n x n multiplier with a n=k*m bit result
k
kjiji
jimji
k
kjiji
jimji
k
jj
jmk
ii
im bababa
11,
)(
11,
)(
11
2222
• The evaluation of the first sum requires k(k+1)/2 multiplications,the evaluation of the second depends on the rounding mode
• For floating point numbers additional operations for the correct handling of the exponent are necessary
• A float-float emulation is less complex than an exact double emulation, but typically still requires 10 times more operations
10
OverviewOverview
• Precision and Accuracy
• Hardware Resources
• Mixed Precision Iterative Refinement
11
Generalized Iterative RefinementGeneralized Iterative Refinement
with find parameters and :function aFor 0 NMNN XQF
0);( 0 QXF
iterate we some with starting exactly, solvecannot weAs 0 NXF
,~
:,0);~
(),,,,(: 111101 kkkkkkkk XXXQXFQQXHQ
. parametersdifferent with solve repeatedly wei.e. kPF
equations of systemlinear a solve toused typicallyis This BAX 11111 ~
:,0~
,: kkkkkkk XXXXABAXBB
process iterativean itself requires osolution t eapproximat The 2)
directly osolution t eapproximatan findcan We1)
:cases h twodistinguis weNow
F
F
12
Direct Scheme Example: LU SolverDirect Scheme Example: LU Solver
refinement iterative with solve We BAX 11111 ~
:,~
,: kkkkkkk XXXBXAAXBB
viaprecision singlein solved is ~
eperformanchigher For 11 kk BXA
precision. singlein once computed ision decomposit thewhere LUPA ,
~, 1111 kkkk YXUPBLY
).(order has which ,ion decomposit with theprecision single
in timelessfar spending whileprecision, doublein on accumulati theand
residual thecomputingby accuracy theincrease that weispoint main The
3NOLUPA
precision. single and double condition,matrix theof log are ,, the
where)),/(ceil(by given is iterations on the boundupper An
10sd
sd
ttK
tt
[J. Dongarra et al., 2006]
13
1X
*X
Iterative Refinement: First and Second StepIterative Refinement: First and Second Step
*X
0X
1~X
High precision paththrough fine nodes
Low precision paththrough coarse nodes
1X
2X
2~X
refinement iterative with solve We BAX 11111 ~
:,~
,: kkkkkkk XXXBXAAXBB
14
Iterative Scheme Example: Stationary SolverIterative Scheme Example: Stationary Solver
11111 ~:,
~,: lllllll UUUBUAAUBB
We obtain a convergent series: *210 ,,, UUUUU kk
k K
refinement iterative with solve We BAU
e.g. ,~
for solver iterativean useoften we sparse large,For 11 ll BUAA
.,on stationary depend vector theand matrix thewhere 1lBACM
CUMU kk 1
)(:guess, initial: 10 kkk UGUUU
To clarify the interaction of these two iterative schemes let us consider a general convergent iterative scheme
[D. Göddeke et al., 2005]
15
Mixed Precision for Convergent SchemesMixed Precision for Convergent Schemes
)(:1 kkk UGUU
1
0
0max
max )(:k
k
kk UGUU
Explicit solution representation
Problem: Summation of addends with decreasing size.
1
0
10
1max
max
max )(:L
l
k
kk
kk
l
l
UGUU max2max
1max ,,,,0 kkk UUUU K
Solution: Split the sum into a sum of partial sums (outer and inner loop).
)0()()()(:)( maxmaxmaxmax1 GUGUUGUUUUllll kkkkkkk
Precision reduction: Reduce the number range for G, e.g. G affine in U:
Iterative refinement: this formulation is equivalent tothe refinement step in the outer iteration scheme for
11~ ll BUA
16
3V
Iterative Convergence: First Partial SumIterative Convergence: First Partial Sum
Convergent iterative scheme
*U0U
1U
)( 0UG
2U
)( 1UG
3U)( 2UG
4U )( 3UG5U )( 4UG
1V
2V
4V5V
6V
High precision paththrough fine nodes
Low precision paththrough coarse nodes
0)( , ),(: *1 kk
kkk
kkkk UGUUUGUU
17
Iterative Convergence: Second Partial SumIterative Convergence: Second Partial Sum
0)( , ),(: *1 kk
kkk
kkkk UGUUUGUU
Convergent iterative scheme
*U
5U
6U)( 5UG
7U )( 6UG8U )( 7UG
8V
9V10V
11V
6V7V
High precision paththrough fine nodes
Low precision paththrough coarse nodes
18
CPU Results: LU SolverCPU Results: LU Solver
chart courtesy
of Jack Dongarra
19
GPU Results: Conjugate Gradient and MultigridGPU Results: Conjugate Gradient and Multigrid
0
5e-05
0.0001
0.00015
0.0002
0.00025
0.0003
1000 10000 100000 1e+06 1e+07
Se
con
ds
pe
r g
rid
no
de
Domain size in grid nodes
Normlized CPU (double) and CPU-GPU (mixed precision) execution time
1x1 CG: Opteron 2501x1 CG: GF7800GTX
2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX
20
ConclusionsConclusions
• The relation between computational precision and final accuracy is not monotonic
• Iterative refinement allows to reduce the precision of many operations without a loss of final accuracy
• In multiplier dominated designs the resulting savings grow quadratically (area or time)
• Area and time improvements benefit various architectures: FPGA, CPU, GPU, Cell, etc.