Towards Formal Approaches to System Resilience Vishal Chandra Sharma * , Arvind Haran * , Zvonimir Rakamaric * , Ganesh Gopalakrishnan *§ {vcsharma, haran, zvonimir, ganesh}@cs.utah.edu School of Computing, University of Utah * Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426. § Faculty Associate, SUPER (http://super-scidac.org/)
Towards Formal Approaches to System Resilience. Vishal Chandra Sharma * , Arvind Haran * , Zvonimir Rakamaric * , Ganesh Gopalakrishnan *§ { vcsharma , haran , zvonimir , ganesh }@cs.utah.edu School of Computing, University of Utah. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
{vcsharma, haran, zvonimir, ganesh}@cs.utah.eduSchool of Computing, University of Utah
*Supported in part by NSF Award CCF 1255776 and SRC contract 2013-TJ-2426.§Faculty Associate, SUPER (http://super-scidac.org/)
2
Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks
3
Motivation• Recent studies show resiliency as a growing area of
concern [arg13] [lanl05] • MTBF decreasing at a faster rate in exascale computing• Dynamic voltage/frequency scaling in low power
computing
• Our goal is to improve application-level resiliency
• Primary focus is to detect transient faults in a software
Silent data corruption (SDC)
4
Motivating Example
printf(“x=%d, y=%d” ,x ,y)
if (x < 3 && y > 10)y++;
int x = 2;int y = 11;
elsex++;
Rashmi Mishra
example with corrupting lower order bits
5
Motivating Example
printf(“x=%d, y=%d” ,x ,y)
int x = 2;int y = 11;
elsex++;
if (x < 3 && y > 10)y++;
Program output:x=2, y=12
Rashmi Mishra
example with corrupting lower order bits
6
Motivating Example
printf(“x=%d, y=%d” ,x ,y)
int x = 3;int y = 11;
elsex++;
if (x < 3 && y > 10)y++;
LSB position of x flipped
Rashmi Mishra
example with corrupting lower order bits
7
Motivating Example
printf(“x=%d, y=%d” ,x ,y)
int x = 3;int y = 11;
elsex++;
if (x < 3 && y > 10)y++;
Program output:x=4, y=11
LSB position of x flipped
SDC in the output value of x
Rashmi Mishra
example with corrupting lower order bits
8
Our Contribution• A LLVM-level fault Injector for evaluation purpose
[llvm04]
• A simple case study on sorting algorithms• Demonstrates effectiveness of our solution• Highlights importance of design space exploration w.r.t.
resiliency
• A software-level fault detector based on idea of predicate abstraction• Applying it in resiliency research is a novel direction!• Introduced by Ball to define a novel program coverage
metrics [pct05]
Rashmi Mishra
Confirm if it is a transient faults
Rashmi Mishra
one line bullet
Rashmi Mishra
Clearly state the problem
Rashmi Mishra
other prominent study..
9
Closely Related Work• Low-cost software level detectors
iSWAT by Sahoo et. al. uses likely program invariants [iswat08] Derives likely invariants by monitoring program properties Hardware-assisted framework to detect false positives
Error detector by Sloan et.al. [sloan13] Algorithm based error detector applied to linear solvers Utilizes algorithmic properties of linear solvers to detect and isolate
errors
• Software-level fault injectors LLVM-level fault injector developed by Kuijif et. al. [relax10]
Publicly unavailable A recent study done by a user suggests our fault injector has better
fine-grained options [schen13] LLFI fault injector by Thomas et. al. [thomas13]
Developed around same time as our fault injector, shares many similar features
10
Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks
11
Fault InjectorKontrollable Utah’s LLVM based Fault Injector KULFI
KULFI Indian dessert
12
KULFI: Fault Injection LogicStart
Forall dynamic instructions
Inject Fault with user provided probability
Feasible?
Stop
Yes
No
Rashmi Mishra
Remove pointer vs data error distinction
13
KULFI: Fault Injection ProcessProgram
Clang
LLVM bitcode
LLVMKULFI
Dynamic Instruction
Count
Fault Injecting
LLVM bitcode
Program Input Vectors
LLVM
Execution Outcome
KULFI
SDCSegFaultBenign
14
Overview• Introduction• Fault Injector• Case Study• Fault Detector• Concluding Remarks
design decisions.. design space exploration...evaluate programs
31
References[arg13] Snir, M., et al. Addressing Failures in Exascale Computing. No. ANL/MCS-TM-33. Argonne National Laboratory (ANL), 2013[lanl05] Michalak, Sarah E., et al. "Predicting the number of fatal soft errors in Los Alamos National Laboratory's ASC Q supercomputer." IEEE Transactions on Device and Materials Reliability, 2005[llvm04] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization (CGO), 2004[pct05] T. Ball, “A theory of predicate-complete test coverage and generation,” in International Conference on Formal Methods for Components and Objects (FMCO), 2005[iswat08] S. K. Sahoo, M. lap Li, P. Ramachandran, S. V. Adve, V. S. Adve, and Y. Zhou, “Using likely program invariants to detect hardware errors,” in IEEE International Conference on Dependable Systems and Networks (DSN), 2008[sloan13] Sloan, Joseph, Rakesh Kumar, and Greg Bronevetsky. "An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance.“, in IEEE International Conference on Dependable Systems and Networks (DSN), 2013
32
References[slu99] Demmel, James W., et al. "A supernodal approach to sparse partial pivoting.“ SIAM Journal on Matrix Analysis and Applications, 1999
[slu05] Li, Xiaoye S. "An overview of SuperLU: Algorithms, implementation, and user interface." ACM Transactions on Mathematical Software (TOMS), 2005[slu11] Li, X. S., Demmel, J. W., Gilbert, J. R., Grigori, L., Shao, M., & Yamazaki, I. (2011). SuperLU Users’ Guide. url: http://crd. lbl. gov/~ xiaoye/SuperLU/superlu_ug. Pdf.
[sprs11] Davis, Timothy A., and Yifan Hu. "The University of Florida sparse matrix collection." ACM Transactions on Mathematical Software (TOMS), 2011
[parsec08] C. Bienia, S. Kumar, J. Singh, and K. Li, “The PARSEC benchmark suite: Characterization and architectural implications,” ser. PACT, 2008
[relax10] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An ar- chitectural framework for software recovery of hardware faults,” in International Symposium on Computer Architecture (ISCA), 2010
[thomas13] Thomas, Anna, and Karthik Pattabiraman. "Error Detector Placement for Soft Computation." in International Conference on Dependable Systems and Networks (DSN), 2013.[schen13] S. Chen, personal communication, 2013.
33
Acknowledgements• Pedro Diniz• Prabhakar Kudva• Shuvendu Lahiri• Karthik Pattabiraman • Sui Chen• Anonymous reviewers of PRDC conference who reviewed