Top Banner
Silent error resilience in numerical time-stepping schemes Austin Benson* Institute for Computational and Mathematical Engineering Stanford University Sven Schmit* (ICME) and Rob Schreiber (HP Labs) SIAM PP 2014 * work done while interning at HP Labs February 19, 2014
28

Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Jul 13, 2015

Download

Engineering

Austin Benson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Silent error resiliencein numerical time-stepping schemes

Austin Benson*Institute for Computational and Mathematical Engineering

Stanford University

Sven Schmit* (ICME) and Rob Schreiber (HP Labs)

SIAM PP 2014

* work done while interning at HP Labs

February 19, 2014

Page 2: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Illustrative example 2

Crank−Nicolson Solution

i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

ut =1

100uxx + 0.1 (sin(2πt) + cos(2πx))

t ∈ [0, 2], x ∈ [0, 1]

u(x , 0) = x(x − 1)

∆x = 1/160,∆t = 1/100

Page 3: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Illustrative example: what’s at fault? 3

Crank−Nicolson Solution

i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

I At step 120, multiplied single entry in RHS of Crank-Nicolsonand Backward Euler linear solves by 0.995

Page 4: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Main idea 4

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I At each time step, base method B generates B1,B2, . . .

I Auxiliary method A “checks” with A1,A2, . . .

I Di = ||Bi − Ai || abnormal → possible error

Page 5: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Main idea 4

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I At each time step, base method B generates B1,B2, . . .

I Auxiliary method A “checks” with A1,A2, . . .

I Di = ||Bi − Ai || abnormal → possible error

Page 6: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Main idea 4

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I At each time step, base method B generates B1,B2, . . .

I Auxiliary method A “checks” with A1,A2, . . .

I Di = ||Bi − Ai || abnormal → possible error

Page 7: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

What are these things? 5

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I Base method B: higher-order scheme (Runge-Kutta 5)

I Auxiliary method A “checks”: lower-order scheme(Runge-Kutta 4)

I Want A needs to be cheap: embedded pairs

[Fehlberg, 1969], [Dormand and Prince, 1980]

Page 8: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

What are these things? 5

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I Base method B: higher-order scheme (Runge-Kutta 5)

I Auxiliary method A “checks”: lower-order scheme(Runge-Kutta 4)

I Want A needs to be cheap: embedded pairs

[Fehlberg, 1969], [Dormand and Prince, 1980]

Page 9: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

What are these things? 5

0 50 100 15010

−10

10−8

10−6

10−4

10−2

iteration (i)

Di

RK 4/5 differences

I Base method B: higher-order scheme (Runge-Kutta 5)

I Auxiliary method A “checks”: lower-order scheme(Runge-Kutta 4)

I Want A needs to be cheap: embedded pairs

[Fehlberg, 1969], [Dormand and Prince, 1980]

Page 10: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

RK 1/2 A/B scheme 6

ODE: u′ = f (t, u).

kB1 = f (tn, uBn )

uBn+1 = uBn + hf(tn + h/2, uBn + hkB1 /2

)Re-use data!

uAn+1 = uBn + hkB1

Dn+1 = ‖uAn+1 − uBn+1‖

Page 11: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Forward / Backward Euler A/B scheme 7

Want to solve: ut = kuxx (1D)

AUBn+1 = UBn

Re-use data!

UAn+1 = BUBn

Dn+1 = ‖UBn+1 − UAn+1‖

Page 12: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Lots of these schemes 8

I Backward / Forward Euler, Richardson / Crank-Nicolson

I Runge-Kutta 2/3, 4/5

I Adams-Bashforth linear multistep method 2/3, 4/5

I Explicit check on implicit scheme

I Extrapolation

I Key idea: Auxiliary method A re-uses data andcommunication from base method B

Page 13: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Lots of these schemes 8

I Backward / Forward Euler, Richardson / Crank-Nicolson

I Runge-Kutta 2/3, 4/5

I Adams-Bashforth linear multistep method 2/3, 4/5

I Explicit check on implicit scheme

I Extrapolation

I Key idea: Auxiliary method A re-uses data andcommunication from base method B

Page 14: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Detecting errors 9

Crank−Nicolson Solution

i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

I Exercise in step detection

Page 15: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Detecting errors 10

Crank−Nicolson Solution

i∆t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

Dn+1 = ‖An+1 − Bn+1‖∞

Jn+1 =Dn+1 − Dn

Dn, relative jump

Vn+1 =Var(Dn−p+1, . . . ,Dn+1)

Var(Dn−p, . . . ,Dn), variance change

I p = 10 is usually good

Page 16: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Error detection algorithm 11

input : tolerances τJ and τV , scaling parameters Γ > 1, γ < 1for n = 1, 2, . . . do

Dn+1 := ‖An+1 − Bn+1‖if Jn+1 > τJ and Vn+1 > τV then

FlagError()

Move back in timeendif Jn+1 > τJ then τJ := ΓτJ else τJ := γτJif Vn+1 > τV then τV := ΓτV else τV := γτV

end

I Γ = 1.4, γ = 0.95

Page 17: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Which errors matter? 12

I Bn and An are the outputs of B and A when a fault is injected

I Bn and An are the outputs when no fault is injected

Local truncation error-normalized error:

Ln =‖Bn − Bn‖‖Bn − An‖

≈ Difference caused by error

local truncation error

Page 18: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Experimental setup 13

Crank−Nicolson Solutioni∆

t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

I Solve equation and artificially inject error at one time step

I Do this for many trials with different types of errors

I True positive rate: #(real errors detected) / #(trials)

I False positive rate: #(non-errors “detected”) / #(time steps)

Page 19: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Experimental setup 13

Crank−Nicolson Solutioni∆

t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

I Solve equation and artificially inject error at one time step

I Do this for many trials with different types of errors

I True positive rate: #(real errors detected) / #(trials)

I False positive rate: #(non-errors “detected”) / #(time steps)

Page 20: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Experimental setup 13

Crank−Nicolson Solutioni∆

t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

I Solve equation and artificially inject error at one time step

I Do this for many trials with different types of errors

I True positive rate: #(real errors detected) / #(trials)

I False positive rate: #(non-errors “detected”) / #(time steps)

Page 21: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Experimental setup 13

Crank−Nicolson Solutioni∆

t

x

0 0.2 0.4 0.6 0.8 1

0

0.5

1

1.5

20 50 100 150 200

10−7

10−6

10−5

10−4

10−3

10−2

Di

i

Richardson / Crank−Nicolson

forward / backward Euler

I Solve equation and artificially inject error at one time step

I Do this for many trials with different types of errors

I True positive rate: #(real errors detected) / #(trials)

I False positive rate: #(non-errors “detected”) / #(time steps)

Page 22: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Heat equation 14

I ut = 0.001uxx + (1−√

1− 4(t − t2))/(2− 2t)

I u(x , 0) = 6|x − 1/2| − 3

I Error:Multiply entry of RHS in linear solves by z ∼ N(1, 5e-5)at a single time step

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

FE/BE, ∆x = 1 / 200, ∆t = 1 / 100

FPR = 0.000

Detected at step of fault

Detected at step or step after fault

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

R/CN, ∆x = 1 / 200, ∆t = 1 / 100

FPR = 0.012

Page 23: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Heat equation 15

I ut = 0.01uxx + q(x , t), q(x , t) = xe−t/2

I u(x , 0) = 4x(x − 1)(x − 2)

I Error:Multiply q(x , t) at one discrete x by z ∼ N(1, 0.1)at a single time step

0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

FE/BE, ∆x = 1 / 100, ∆t = 1 / 60

FPR = 0.000

Detected at step of fault

Detected at step or step after fault

0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

R/CN, ∆x = 1 / 100, ∆t = 1 / 60

FPR = 0.000

Page 24: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Adams-Bashforth 16

I u′′

(t)− b(1− u(t)2)u′(t) + u(t) = 0

I u′(0) = 1, u(0) = 0

I Error:Multiply one derivative evaluation by z ∼ N(1, 0.1)

100

101

102

103

0

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

AB23 on Van der Pol with h = 1 / 20, b = 2

FPR = 0.037

Detected at step of fault

Detected at step or step after fault

100

101

102

103

0

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

AB45 on Van der Pol with h = 1 / 20, b = 2

FPR = 0.052

Page 25: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Runge-Kutta 17

I u′′

(t)− b(1− u(t)2)u′(t) + u(t) = 0

I u′(0) = 1, u(0) = 0

I Error:Multiply one derivative evaluation by z ∼ N(1, 0.1)

100

101

102

103

0

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

RK23 on Van der Pol with h = 1 / 10, b = 2

FPR = 0.066

100

101

102

103

0

0.2

0.4

0.6

0.8

1

LTE−normalized Error

Tru

e p

ositiv

e r

ate

RK45 on Van der Pol with h = 1 / 10, b = 2

FPR = 0.098

Page 26: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Key ideas 18

Key ideas:

I Take advantage of “paired” solvers to check solutions

I High-impact error → easier to detect

I Simple detection scheme work pretty well

Page 27: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

End 19

Questions? Samples:

I What is the performance penalty?

I Why does detection occur one step after the fault?

Information:

I Austin Benson: [email protected]

I Pre-print: see http://stanford.edu/~arbenson

Page 28: Silent error detection in numerical time stepping schemes (SIAM PP 2014)

Tardy error detection 20

128 130 132 134 136 138 1402.8

3

3.2

3.4

3.6

3.8

4x 10

−5

Time step (i)

Di

Tardy error detection on heat equation

FE/BE difference

Step of fault

0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5

4x 10

−5

i (vector component)

|BE

(i)

− F

E(i)|

Component−wise absolute difference BE/FE

Step before fault

Step of fault

Step after fault