Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures. Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05 Shimin Chen LBA Reading Group Presentation
Feb 05, 2016
Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures.
Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05
Shimin ChenLBA Reading Group Presentation
Motivation
High availability is important– Critical applications: process control, etc.– Financial company: an hour of downtime costs $6 million
SW defects account for up to 40% of system failures– Common: memory-related bugs and concurrency bugs
Bugs still occur in production runs– Even after SW company spends enormous effort on testing
Ask for mechanisms for surviving software bugs
Previous Work on Surviving SW Failures
Four categories:– Rebooting– Checkpointing and recovery– Application-specific mechanisms– Recent proposals:
Failure-oblivious computing Reactive immune system
Previous Work 1: Rebooting
Schemes:– Whole program restart– Micro-rebooting of partial system components– SW rejuvenation (proactively restart processes)
Problem:– Cannot deal with deterministic bugs– Restart time
Previous Work 2: General checkpointing and recovery
Schemes:– Checkpoint, rollback, re-execute– Or use a backup server
Problems:– Cannot deal with deterministic bugs
Progressive retry in distributed systems: – Reorder messages to get around SW bugs, but not bugs on
single system N-version programming:
– Too expensive
Previous Work 3: Application-Specific Recovery Mechanisms
Multi-process model (MPM)– Kill a request-handling process and start a new one
Problems:– Cannot handle deterministic bugs– What if shared data structure is corrupted?
Previous Work 4: Recent Non-Conventional Proposals
Failure-oblivious computing– Manufacture values for out-of-bound reads– Discard out-of-bound writes
Reactive immune system– Detect failures of function calls– Forcefully return from the function with a manufactured
error return value (e.g. -1 for int, 0 for unsigned int etc.)
Problem:– Unsafe for correctness-critical applications (e.g. banking)
New Proposal: Rx
Rollback the program to a recent checkpoint when a bug is detected
Dynamically change the execution environment based on the failure symptoms
Re-execute the buggy code in the new environment Features:
– Comprehensive: can deal with deterministic bugs– Safe: do not speculatively “fix” bugs, but change environment– Noninvasive: no changes to app source code– Efficient– Informative: help locating the bugs
Outline
Introduction Main Idea of Rx Rx Design & Implementation Evaluation Summary
Main Idea
Record the changes for offline diagnosis
Useful Execution Environmental Changes
Must be safe and may avoid bugs Memory management based
– Buffer overflows, dangling pointers, etc.
Timing based– Concurrency bugs
User request based– Dropping unexpected (malicious) user request– As a last resort
Outline
Introduction Main Idea of Rx Rx Design & Implementation Evaluation Summary
Rx Components Overview
1
23
4
5
Sensors for Detecting SW Failures
OS-raised exceptions:– Assertion failures, segfault, divide-by-zero, etc.
Fine-grain detection: – buffer overflow, accesses to freed memory etc.
Only implemented OS-raised exceptions
Checkpoint and Rollback (Flashback)
Memory state: fork-like operation Files: keep a copy of each accessed files and file pointers for a
checkpoint Checkpoint management:
– Equal intervals or exponential landmarks– Limit oldest checkpoint by considering recovery time goal
Multi-threaded process checkpointing– Send a signal to all threads to make them exit from blocked
syscalls with EINTR– Take checkpoint– Library wrapper in Rx retries syscalls– High cost so cannot be frequent
Environment Wrappers
Memory wrapper: (intercepting library calls)– Delaying free:
keep a freed buffer for a threshold (process) time FIFO recycling
– Padding buffers: adds two fixed-size padding to both ends of allocated buffers
– Allocation isolation: put allocated buffers to isolated locations
– Zero-filling– Do the above during re-execution for failed code region only
Other Wrappers
Message wrapper (in proxy)– Randomly shuffle message orders of different connections
while keeping the message order of the same connection– Randomize packet sizes
Process scheduling: change process’ priority Signal delivery: randomize hw interrupt delivery time
while preserving order Dropping user requests
– Binary search for bad requests– Drop at most 10% of requests
Proxy
Control Unit
Coordinate checkpoint/roll back, environment changes etc. Failure vector <S1, S2, …, Sm> per failure symptom (exception
type, PC adderss, call chain etc.)– Si is the score for environmental change #i– If change #i is successful, Si++; if failed, Si - -– Try the changes with scores greater than a certain threshold first
Outline
Introduction Main Idea of Rx Rx Design & Implementation Evaluation Summary
Setup
A client machine and a server machine– 2.4GHz x86 CPU, 512KB L2 cache, 1GB DRAM– 100Mbps Ethernet
Injected bugs
Overall Results
Checkpoint Overhead
Time: with checkpoint interval of 200ms, 5% overhead (MySQL)
Workloads:
• apache, squid: 5 threads, GET files with size uniform [1KB, 512KB]
• CVS: client exports a 30KB file
• MySQL: 5 client threads, transactions on a small table
Summary
Rx: re-executing the buggy program region in a modified execution environment
Not panacea:– Semantic bugs, resource leaks– Latent bugs (long delay from bug to symptom)