Tests and Tolerances for High- Performance Software-Implemented Fault Detection Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou
Feb 05, 2016
Tests and Tolerances for High-Performance Software-Implemented Fault Detection
Michael Turmon, Robert Granat, Daniel S.Katz, John Z.Lou
Objective
Software fault detection in common numerical libraries by checking computed output
Faulty environment here essentially constitutes bit flips in application’s state space
Distinguish between errors and round-offs in computed results
Faults and EDMs
Single Event Upsets Radiation induced errors causing bit flips in memory, cache Effects application data and code Data errors are more difficult to detect
Error Detecting Middleware Wrap existing numerical libraries Avoid altering internals of the library More efficient than original computation
Numerical Error Checking - Summary
Consider common numerical matrix computations
Use “post-conditions” to evaluate correctness Post-condition: Necessary relation between inputs &
computed outputs
Use well-known upper bounds on error propagation within numerical algorithms for matrix computations
Define tests and tolerances to separate errors and round-offs
Develop input-independent tolerances
Definitions: Vector & Matrix norms
Vector:||v||1 = ∑ |vi|
||v||∞ = max|vi|
||v||2 = (∑|vi|2)1/2
Matrices:||A||1 = max. column sum of A
||A||∞ = max. row sum of A
||A||2 = largest singular value of A
||A||F = ( |aij| 2)1/2
Matrices review
Orthogonal MatrixA AT = I => A-1 = AT
Unitary Matrix A*T = A-1
Permutation MatrixReordered rows of I
Sub-multiplicative property ||Av|| ≤||A|| ||v||
||AB|| ≤||A|| ||B||
Numerical Functions
Matrix multiplication QR decomposition
A = Q * R A = input matrix Q = Orthogonal matrix R = upper triangular matrix
Singular Value decompositionA = U * D * VT
A = input matrix D = diagonal matrix U & V = orthogonal matrices
Numerical Functions (contd.)
LU decomposition A = P* L*U
P = permutation matrix L = lower triangular matrix U = upper triangular matrix
System Solution Solve for x in Ax=b , given A & b
Matrix inverse Given A, find B such that A*B = I
Numerical functions (contd.)
Fourier transform Given x, find y such that y=W x, where W is the matrix of
Fourier basis, Wnk = e-j2kn/N
Inverse Fourier transform Given y, find x such that x = n-1WTy where W is n*n matrix
of Fourier bases (WT = W-1)
Operations & Post-conditions
Post-condition check A = Q * R -> computationally intense
Instead multiply with probe vector w and compare vectors
w A >< w Q R
Choice of w Elements of w should not vary greatly in magnitude w should be non-zero everywhere Can be a vector of all ones, except for FFT
Probe Vector
^ ^
^^
Error Propagation – Matrix multiplication
Error matrix E = P – AB P = mult(A,B)
||E||∞ n ||A||∞ ||B||∞ u u = difference between unity & next larger float number, n
= dimension common to A & B
d = P w – A B w = E w
||d||∞ = ||E w||∞ ||E||∞ ||w||∞ n ||A||∞ ||B||∞ ||w||∞ u
||d||∞ / ||A||∞ ||B||∞ ||w||∞ >< u
n is ignored – in average case, round-off errors independent of dimension
^
^
^
Error Propagation
QRD: ||d||F / (||A||F ||w||F ) >
< u d = Q R w – A w
SVD: ||d|| / (||A|| ||w|| ) >
< u d = U D VT w – A w
LUD: ||d|| / (||A|| ||w|| ) >
< u
d = P L U w – A w
^ ^
^ ^
^^ ^
^
Error Propagation (contd.)
Solve Ax = b: ||d|| / (||A|| ||x|| ) >
< u d = A x – b
Matrix inverse: ||d|| / (||A|| ||B|| ||w|| ) >
< u d = B A w - w
^
^
^
^
Error Propagation - FFT
Forward Transform: d = (y – Wx)T w
W is the n*n forward transform matrix containing the Fourier basis functions
w cannot have a sparse transform Error propagation: ||e|| 5nlog2n ||x|| u |d| /(nlog2n ||x||2 ||w||2) >
< u
Inverse Transform: d = (x – n-1 WT y)T w |d|/(log2n ||y||2 ||w||2) >
< u
^
Comparison Tests
= RHS – LHS and = || w|| ( never actually computed)
T0: /||w|| >< u
Trivial test:Un-normalized comparison
T1: /(1 ||w||) >< u
Ideal test: may not always be computable
T2: /(2 ||w||) >< u
Approx. matrix test: based on computed quantities
T3: /(||w||+3) >< u
Approx. vector test: higher chance of false alarms
Experiments
Faults are injected in half the runs by changing a random bit of the algorithm’s state space
Faults are injected at random point of execution
The threshold value is chosen based on error quantity computed in the faulty and fault-free conditions
Choosing
T2, ,T 1
T3
T0
`
ROC for FFT
Alternate tests for FFT: Parseval’s condition:
(||x||2 - n-1/2 ||y||2 )/ ||x||2 >< u
Choosing a vector w2 with real & imag. parts equal to :
cos(4(k – n/2)/n), k=0,1,….n-1
and compute difference as before
Related work
ABFT – introduced by Huang & Abraham for matrix operations, 1984
Error detection based on algorithm employed – matrix encoded with checksum matrix
Vastly extended by others for various numerical operations
Result Checking – introduced by Blum & Wasserman – focus on computation errors,1996
Prata & Silva compared the two, found for Matrix mult. & QRD, RC more efficient than ABFT, 1999
Summary
Faults detected based on conditions that numerical output must satisfy
Implemented as wrappers around existing libraries Run experiments under fault-free & faulty conditions
and observe decision criterion ub >> * => can be set based on an average-case
outlook rather than assuming worst-case scenario Selecting a trade-off between fault detection & false
alarms Can be extended to other common computations like
Sorting, Integration, etc.