Accuracy Robert Strzodka. 2Overview Precision and Accuracy Hardware Resources Mixed Precision Iterative Refinement.

AccuracyAccuracy

Robert StrzodkaRobert Strzodka

2

OverviewOverview

• Precision and Accuracy

• Hardware Resources

• Mixed Precision Iterative Refinement

3

Roundoff and CancellationRoundoff and Cancellation

Roundoff examples for the float s23e8 format

additive roundoff a= 1 + 0.00000004 =fl 1multiplicative roundoff b= 1.0002 * 0.9998 =fl 1cancellation c=a,b (c-1) * 108 =fl 0

Cancellation promotes the small error 0.00000004to the absolute error 4 and a relative error 1.

Order of operations can be crucial: 1 + 0.00000004 – 1=fl 0 1 – 1 + 0.00000004=fl 0.00000004

4

More PrecisionMore Precision

float s23e8 1.1726double s52e11 1.17260394005318long double s63e15 1.172603940053178631

This is all wrong, even the sign is wrong!!

-0.82739605994682136814116509547981629…

The correct result is

Lesson learnt: Computational Precision ≠ Accuracy of Result

)2/(5.5)212111()75.333(),( 8422262 yxyyyxxyxyxf

Evaluating (with powers as multiplications) [S.M. Rump, 1988]

33096,77617 00 yxfor gives

5

Precision and AccuracyPrecision and Accuracy

• There is no monotonic relation between the computational precision and the accuracy of the final result.

• Increasing precision can decrease accuracy !

• Even when one can prove positive effects of increased precision, it is very difficult to quantify them.

• We often simply rely on the experience that increased precision helps in common cases.

• But for common cases we need high precision only in very few places to obtain the desired accuracy.

6

OverviewOverview




7

Resources for Signed Integer OperationsResources for Signed Integer Operations

Operation Area Latency

min(r,0)

max(r,0)b+1 2

add(r1,r2)

sub(r1,r2)2b b

add(r1,r2,r3)add(r4,r5) 2b 1mult(r1,r2)

sqr(r) b(b-2) b ld(b)

sqrt(r) 2c(c-5) c(c+3)b: bitlength of argument, c: bitlength of result

8

Arithmetic Area Consumption on a FPGAArithmetic Area Consumption on a FPGA

0

200

400

600

800

1000

1200

1400

1600

20 25 30 35 40 45 50

Nu

mb

er

of sl

ice

s

Bits of mantissa

Area of s??e11 float kernels on the xc2v500/xc2v8000(CG)

AdderMultiplier

CG kernel normalized (1/30)

9

Higher Precision EmulationHigher Precision Emulation

• Given a m x m bit unsigned integer multiplier we want to build a n x n multiplier with a n=k*m bit result

k

kjiji

jimji

k

kjiji

jimji

k

jj

jmk

ii

im bababa

11,

)(

11,

)(

11

2222

• The evaluation of the first sum requires k(k+1)/2 multiplications,the evaluation of the second depends on the rounding mode

• For floating point numbers additional operations for the correct handling of the exponent are necessary

• A float-float emulation is less complex than an exact double emulation, but typically still requires 10 times more operations

10

OverviewOverview




11

Generalized Iterative RefinementGeneralized Iterative Refinement

with find parameters and :function aFor 0 NMNN XQF

0);( 0 QXF

iterate we some with starting exactly, solvecannot weAs 0 NXF

,~

:,0);~

(),,,,(: 111101 kkkkkkkk XXXQXFQQXHQ

. parametersdifferent with solve repeatedly wei.e. kPF

equations of systemlinear a solve toused typicallyis This BAX 11111 ~

:,0~

,: kkkkkkk XXXXABAXBB

process iterativean itself requires osolution t eapproximat The 2)

directly osolution t eapproximatan findcan We1)

:cases h twodistinguis weNow

F

F

12

Direct Scheme Example: LU SolverDirect Scheme Example: LU Solver

refinement iterative with solve We BAX 11111 ~

:,~

,: kkkkkkk XXXBXAAXBB

viaprecision singlein solved is ~

eperformanchigher For 11 kk BXA

precision. singlein once computed ision decomposit thewhere LUPA ,

~, 1111 kkkk YXUPBLY

).(order has which ,ion decomposit with theprecision single

in timelessfar spending whileprecision, doublein on accumulati theand

residual thecomputingby accuracy theincrease that weispoint main The

3NOLUPA

precision. single and double condition,matrix theof log are ,, the

where)),/(ceil(by given is iterations on the boundupper An

10sd

sd

ttK

tt

[J. Dongarra et al., 2006]

13

1X

*X

Iterative Refinement: First and Second StepIterative Refinement: First and Second Step

*X

0X

1~X

High precision paththrough fine nodes

Low precision paththrough coarse nodes

1X

2X

2~X

refinement iterative with solve We BAX 11111 ~

:,~

,: kkkkkkk XXXBXAAXBB

14

Iterative Scheme Example: Stationary SolverIterative Scheme Example: Stationary Solver

11111 ~:,

~,: lllllll UUUBUAAUBB

We obtain a convergent series: *210 ,,, UUUUU kk

k K

refinement iterative with solve We BAU

e.g. ,~

for solver iterativean useoften we sparse large,For 11 ll BUAA

.,on stationary depend vector theand matrix thewhere 1lBACM

CUMU kk 1

)(:guess, initial: 10 kkk UGUUU

To clarify the interaction of these two iterative schemes let us consider a general convergent iterative scheme

[D. Göddeke et al., 2005]

15

Mixed Precision for Convergent SchemesMixed Precision for Convergent Schemes

)(:1 kkk UGUU

1

0

0max

max )(:k

k

kk UGUU

Explicit solution representation

Problem: Summation of addends with decreasing size.

1

0

10

1max

max

max )(:L

l

k

kk

kk

l

l

UGUU max2max

1max ,,,,0 kkk UUUU K

Solution: Split the sum into a sum of partial sums (outer and inner loop).

)0()()()(:)( maxmaxmaxmax1 GUGUUGUUUUllll kkkkkkk

Precision reduction: Reduce the number range for G, e.g. G affine in U:

Iterative refinement: this formulation is equivalent tothe refinement step in the outer iteration scheme for

11~ ll BUA

16

3V

Iterative Convergence: First Partial SumIterative Convergence: First Partial Sum

Convergent iterative scheme

*U0U

1U

)( 0UG

2U

)( 1UG

3U)( 2UG

4U )( 3UG5U )( 4UG

1V

2V

4V5V

6V



0)( , ),(: *1 kk

kkk

kkkk UGUUUGUU

17

Iterative Convergence: Second Partial SumIterative Convergence: Second Partial Sum

0)( , ),(: *1 kk

kkk

kkkk UGUUUGUU

Convergent iterative scheme

*U

5U

6U)( 5UG

7U )( 6UG8U )( 7UG

8V

9V10V

11V

6V7V



18

CPU Results: LU SolverCPU Results: LU Solver

chart courtesy

of Jack Dongarra

19

GPU Results: Conjugate Gradient and MultigridGPU Results: Conjugate Gradient and Multigrid

0

5e-05

0.0001

0.00015

0.0002

0.00025

0.0003

1000 10000 100000 1e+06 1e+07

Se

con

ds

pe

r g

rid

no

de

Domain size in grid nodes

Normlized CPU (double) and CPU-GPU (mixed precision) execution time

1x1 CG: Opteron 2501x1 CG: GF7800GTX

2x2 MG__MG: Opteron 2502x2 MG__MG: GF7800GTX

20

ConclusionsConclusions

• The relation between computational precision and final accuracy is not monotonic

• Iterative refinement allows to reduce the precision of many operations without a loss of final accuracy

• In multiplier dominated designs the resulting savings grow quadratically (area or time)

• Area and time improvements benefit various architectures: FPGA, CPU, GPU, Cell, etc.

Accuracy Robert Strzodka. 2Overview Precision and Accuracy Hardware Resources Mixed Precision Iterative Refinement.

Documents

increasing precision

high precision

fpgahigher precision

monotonic iterative

operations overviewprecision

desired accuracy

floatfloat emulation

n x n multiplier