Top Banner
123

Combining Algorithm-Based Fault oleranceT and ...

Mar 17, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Combining Algorithm-Based Fault oleranceT and ...

Draft

1/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Combining Algorithm-Based Fault Tolerance

and Checkpointing for Iterative Solvers

Massimiliano FasiAdvisors: Yves Robert and Bora Uçar

25 june 2014

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 2: Combining Algorithm-Based Fault oleranceT and ...

Draft

2/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

1 IntroductionLinear solversSilent errors

2 Algorithm-Based Fault Tolerance

3 Model

4 Experiments

5 Conclusions

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 3: Combining Algorithm-Based Fault oleranceT and ...

Draft

3/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Selective reliability

High energy mode

reliable

energy wasting

Low energy mode

unreliable

energy e�cient1 2 3 4 5 6 7 8 9

low

high

computational steps

energy

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 4: Combining Algorithm-Based Fault oleranceT and ...

Draft

3/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Selective reliability

High energy mode

reliable

energy wasting

Low energy mode

unreliable

energy e�cient1 2 3 4 5 6 7 8 9

low

high

computational steps

energy

computation

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 5: Combining Algorithm-Based Fault oleranceT and ...

Draft

3/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Selective reliability

High energy mode

reliable

energy wasting

Low energy mode

unreliable

energy e�cient1 2 3 4 5 6 7 8 9

low

high

computational steps

energy

computation

validation

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 6: Combining Algorithm-Based Fault oleranceT and ...

Draft

4/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

The Conjugate Gradient Method

Ax = b

A ∈ Rn×n, x,b ∈ Rn

Remarks on line 5

only matrix operation

A is never modi�ed

Require: A ∈ Rn×n, b, v ∈ Rn, ε ∈ REnsure: x ∈ Rn : | Ax− b |≤ ε1: r0 ← b− Ax0;

2: p0 ← r0;

3: i ← 0;

4: while ‖ri‖ > ε (‖A‖ · ‖r0‖+ ‖b‖) do5: qi ← Api ;

6: αi ← ‖ri ‖2

pᵀiqi

;

7: xi+1 ← xi + α pi ;

8: ri+1 ← ri − α qi ;

9: β ← ‖ri+1‖2

‖ri ‖2;

10: pi+1 ← ri+1 + β pi ;

11: i ← i + 1;

12: end while

13: return xi ;

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 7: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 8: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 9: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 10: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 11: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 12: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 13: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 14: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 15: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 16: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 17: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 18: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 19: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 20: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 21: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 22: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 23: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 24: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 25: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 26: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 27: Combining Algorithm-Based Fault oleranceT and ...

Draft

5/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Fail-stop errors

Easy to detect

Easy to localize and characterize

Expensive to correct

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 28: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 29: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 30: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 31: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 32: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 33: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 34: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 35: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 36: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 37: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 38: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 39: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 40: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 41: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 42: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 43: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 44: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 45: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 46: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 47: Combining Algorithm-Based Fault oleranceT and ...

Draft

6/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing

Well suited for fail-stop errors

Cheaper than restarting from scratch

Trade-o� the best checkpointing interval

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 48: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 49: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 50: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 51: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 52: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 53: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 54: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 55: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 56: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 57: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 58: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 59: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 60: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 61: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 62: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 63: Combining Algorithm-Based Fault oleranceT and ...

Draft

7/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Silent errors

Hard to detect

Hard to localize and characterize

Easy to correct (sometimes)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 64: Combining Algorithm-Based Fault oleranceT and ...

Draft

8/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Linear solversSilent errors

Checkpointing for silent errors

Is not always necessary

the computation can continue

small perturbations do not impact the solution

iterative methods can compensate some errors

Requires veri�cation

a validation mechanism has to be devised

some overhead cannot be avoided

�nding a checkpointing interval becomes even more di�cult

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 65: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 66: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 67: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 68: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 69: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 70: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 71: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 72: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 73: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 74: Combining Algorithm-Based Fault oleranceT and ...

Draft

9/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Silent error sources

A x y

× =

Arithmetic operations

bit �ip of the result

Memory read

in A

bit �ip in one entryhorizontal shiftvertical shift (1 row)

in x

bit �ip in one entry

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 75: Combining Algorithm-Based Fault oleranceT and ...

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 76: Combining Algorithm-Based Fault oleranceT and ...

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 77: Combining Algorithm-Based Fault oleranceT and ...

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 78: Combining Algorithm-Based Fault oleranceT and ...

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 79: Combining Algorithm-Based Fault oleranceT and ...

Draft

10/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

No error

24 2424

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 80: Combining Algorithm-Based Fault oleranceT and ...

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 81: Combining Algorithm-Based Fault oleranceT and ...

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 82: Combining Algorithm-Based Fault oleranceT and ...

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 83: Combining Algorithm-Based Fault oleranceT and ...

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 84: Combining Algorithm-Based Fault oleranceT and ...

Draft

11/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in the computation

24 2426

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

5

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 85: Combining Algorithm-Based Fault oleranceT and ...

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 86: Combining Algorithm-Based Fault oleranceT and ...

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 87: Combining Algorithm-Based Fault oleranceT and ...

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 88: Combining Algorithm-Based Fault oleranceT and ...

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

23

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 89: Combining Algorithm-Based Fault oleranceT and ...

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

23 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 90: Combining Algorithm-Based Fault oleranceT and ...

Draft

12/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

23 2423

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

2

2

2

1

5

3

4

4

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 91: Combining Algorithm-Based Fault oleranceT and ...

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 92: Combining Algorithm-Based Fault oleranceT and ...

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 93: Combining Algorithm-Based Fault oleranceT and ...

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 94: Combining Algorithm-Based Fault oleranceT and ...

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24

24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 95: Combining Algorithm-Based Fault oleranceT and ...

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 24

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 96: Combining Algorithm-Based Fault oleranceT and ...

Draft

13/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Error in x

24 2424

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 97: Combining Algorithm-Based Fault oleranceT and ...

Draft

14/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

How to overcome that issue

Random weight vector

Checksum shifting

Matrix splitting

Hierarchical partitioning

(cᵀA) x = cᵀ (Ax)

c = (1 1 1 ... 1)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 98: Combining Algorithm-Based Fault oleranceT and ...

Draft

14/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

How to overcome that issue

Random weight vector

Checksum shifting

Matrix splitting

Hierarchical partitioning

(cᵀA) x = cᵀ (Ax)

c = (c1 c2 c3 ... cn)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 99: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 100: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

-1 0 1 7 0 6 0 11

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 101: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

2

2

1

5

3

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 102: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 103: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 104: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42

40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 105: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 40

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 106: Combining Algorithm-Based Fault oleranceT and ...

Draft

15/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Checksum shifting

42 4042

18

1

2

3

4

5

6

7

8

-2 2

2 -4 1

-1 -2

-1 1 -3

-3

-2

2 2 2 2 2 2 2 2

1 2 3 9 2 8 2 13

×

1

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

=

1

4

2

0

5

2

4

6

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 107: Combining Algorithm-Based Fault oleranceT and ...

Draft

16/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Summary of ABFT results

checksumcomputation

SpMxVoverhead

single error detection ∼ nnz ∼ 4nk errors detection ∼ k nnz ∼ 4kn

single error correction ∼ 2 nnz ∼ 8nk errors correction ? ?

Table : ABFT techniques

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 108: Combining Algorithm-Based Fault oleranceT and ...

Draft

17/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Not all that seems so is an error

Theorem

Let A ∈ Rn×n, x ∈ Rn, c ∈ Rn. Then, if all of the sums involvedinto the matrix operations are performed using some �avour ofrecursive summation, it holds that

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n | cᵀ | | A | | x | .

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n n ‖cᵀ‖∞ ‖A‖1 ‖x‖∞

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 109: Combining Algorithm-Based Fault oleranceT and ...

Draft

17/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Not all that seems so is an error

Theorem

Let A ∈ Rn×n, x ∈ Rn, c ∈ Rn. Then, if all of the sums involvedinto the matrix operations are performed using some �avour ofrecursive summation, it holds that

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n | cᵀ | | A | | x | .

| � ((cᵀA) x)− � (cᵀ (Ax)) |≤ 2 γ2n n ‖cᵀ‖∞ ‖A‖1 ‖x‖∞

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 110: Combining Algorithm-Based Fault oleranceT and ...

Draft

18/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Preliminaries

Why combining

checkpointing (CP) needs a veri�cation mechanism

ABFT's worst case could require restarting from scratch

Why a trade-o�

CP interval depends on the probability of incorrectable errors

per iteration overhead depends on the kind of ABFT protection

Goal: minimize the expected global execution time

Idea: minimize the expected overhead (ABFT and CP) of a frame

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 111: Combining Algorithm-Based Fault oleranceT and ...

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 112: Combining Algorithm-Based Fault oleranceT and ...

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 113: Combining Algorithm-Based Fault oleranceT and ...

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 114: Combining Algorithm-Based Fault oleranceT and ...

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

(E (Tlost) + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 115: Combining Algorithm-Based Fault oleranceT and ...

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(Ts

)= p s Titer + (1− p )

((s + 1)

2Titer + Trecovery + E

(Ts

) )

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 116: Combining Algorithm-Based Fault oleranceT and ...

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(T (k)s

)= pk s T

(k)iter + (1− pk)

((s + 1)

2T

(k)iter +Trecovery + E

(T (k)s

))

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 117: Combining Algorithm-Based Fault oleranceT and ...

Draft

19/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Expected execution time

p = correctable error probability

s = checkpoint interval

k = correctable errors

E(T (k)s

)= pk s T

(k)iter + (1− pk)

((s + 1)

2T

(k)iter +Trecovery + E

(T (k)s

))

pk =k∑

i=0

q(k)i (s T

(k)iter ), q

(k)` (T ) =

(M

`

)(1− e−λT

)`e−λT (M−`)

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 118: Combining Algorithm-Based Fault oleranceT and ...

Draft

20/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

A probabilistic model

Model

The checkpoint interval that minimizes the expected wasted time is

s = argmins∈N

E(T

(k)s

)− s T

(k)iter + Tcheckpoint

s T(k)iter

.

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 119: Combining Algorithm-Based Fault oleranceT and ...

Draft

21/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Test problems

n nnz(A) κ(A) Convergence

BCSSTK09 1083 18437 3.10173e+04 linearP3D 27000 183600 6.45723e+02 quadraticTHERMAL1 82654 574458 4.96250e+05 sublinear

[From similar studies]Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 120: Combining Algorithm-Based Fault oleranceT and ...

Draft

22/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Empirical validation

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 1002

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

0 10 20 30 40 50 60 70 80 90 1004

5

6

7

8

9

10

11

12

13

14

0 10 20 30 40 50 60 70 80 90 1000

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70 80 90 1002

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

0 10 20 30 40 50 60 70 80 90 1004

5

6

7

8

9

10

11

12

13

14

Figure : Execution time vs checkpoint interval. The expected execution time(continuous line) is compared with the experimentally obtained one (circles),for both CP + ABFT detection (top) and CP + ABFT correction (bottom).

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 121: Combining Algorithm-Based Fault oleranceT and ...

Draft

23/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Experimental comparison

101

102

103

104

1050.2

0.3

0.4

0.5

0.6

0.7

0.8CG-1D

CG-2D1C

101

102

103

104

1052.5

3

3.5

4

4.5

5

5.5CG-1D

CG-2D1C

101

102

103

104

1054

5

6

7

8

9

10CG-1D

CG-2D1C

Figure : Execution time vs reciprocal of the normalized fault rate for bothplain checkpointing (CG-1D) and mixed strategy (CG-2D1C).

min max

BCSSTK09 -2.23 % 12.78 %P3D -0.60 % 26.76 %THERMAL1 -0.08 % 40.44 %

Table : Relative gain of CG-2D1C with respect to CG-1D.

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 122: Combining Algorithm-Based Fault oleranceT and ...

Draft

24/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Summary

silent errors are treacherous

checkpointing needs a veri�cation mechanism

detecting ABFT is a cheap and reliable

correcting ABFT can improve checkpointing's performances

a trade-o� can be established

the same analysis holds for other iterative linear solvers

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers

Page 123: Combining Algorithm-Based Fault oleranceT and ...

Draft

25/25

IntroductionAlgorithm-Based Fault Tolerance

ModelExperimentsConclusions

Future work

General ABFT improvements

extend error correction capabilities for matrix representations

extension to other matrix operations

develop accurate estimates for �oating point errors

Other applications of the ABFT/checkpointing solution

Preconditioned Conjugate Gradient

ABFT for dense iterative methods

Massimiliano Fasi Combining ABFT and Checkpointing for Iterative Solvers