Shimin Chen LBA Reading Group Presentation

Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures.

Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05

Shimin ChenLBA Reading Group Presentation

Motivation

High availability is important– Critical applications: process control, etc.– Financial company: an hour of downtime costs $6 million

SW defects account for up to 40% of system failures– Common: memory-related bugs and concurrency bugs

Bugs still occur in production runs– Even after SW company spends enormous effort on testing

Ask for mechanisms for surviving software bugs

Previous Work on Surviving SW Failures

Four categories:– Rebooting– Checkpointing and recovery– Application-specific mechanisms– Recent proposals:

Failure-oblivious computing Reactive immune system

Previous Work 1: Rebooting

Schemes:– Whole program restart– Micro-rebooting of partial system components– SW rejuvenation (proactively restart processes)

Problem:– Cannot deal with deterministic bugs– Restart time

Previous Work 2: General checkpointing and recovery

Schemes:– Checkpoint, rollback, re-execute– Or use a backup server

Problems:– Cannot deal with deterministic bugs

Progressive retry in distributed systems: – Reorder messages to get around SW bugs, but not bugs on

single system N-version programming:

– Too expensive

Previous Work 3: Application-Specific Recovery Mechanisms

Multi-process model (MPM)– Kill a request-handling process and start a new one

Problems:– Cannot handle deterministic bugs– What if shared data structure is corrupted?

Previous Work 4: Recent Non-Conventional Proposals

Failure-oblivious computing– Manufacture values for out-of-bound reads– Discard out-of-bound writes

Reactive immune system– Detect failures of function calls– Forcefully return from the function with a manufactured

error return value (e.g. -1 for int, 0 for unsigned int etc.)

Problem:– Unsafe for correctness-critical applications (e.g. banking)

New Proposal: Rx

Rollback the program to a recent checkpoint when a bug is detected

Dynamically change the execution environment based on the failure symptoms

Re-execute the buggy code in the new environment Features:

– Comprehensive: can deal with deterministic bugs– Safe: do not speculatively “fix” bugs, but change environment– Noninvasive: no changes to app source code– Efficient– Informative: help locating the bugs

Outline

Introduction Main Idea of Rx Rx Design & Implementation Evaluation Summary

Main Idea

Record the changes for offline diagnosis

Useful Execution Environmental Changes

Must be safe and may avoid bugs Memory management based

– Buffer overflows, dangling pointers, etc.

Timing based– Concurrency bugs

User request based– Dropping unexpected (malicious) user request– As a last resort

Outline


Rx Components Overview

1

23

4

5

Sensors for Detecting SW Failures

OS-raised exceptions:– Assertion failures, segfault, divide-by-zero, etc.

Fine-grain detection: – buffer overflow, accesses to freed memory etc.

Only implemented OS-raised exceptions

Checkpoint and Rollback (Flashback)

Memory state: fork-like operation Files: keep a copy of each accessed files and file pointers for a

checkpoint Checkpoint management:

– Equal intervals or exponential landmarks– Limit oldest checkpoint by considering recovery time goal

Multi-threaded process checkpointing– Send a signal to all threads to make them exit from blocked

syscalls with EINTR– Take checkpoint– Library wrapper in Rx retries syscalls– High cost so cannot be frequent

Environment Wrappers

Memory wrapper: (intercepting library calls)– Delaying free:

keep a freed buffer for a threshold (process) time FIFO recycling

– Padding buffers: adds two fixed-size padding to both ends of allocated buffers

– Allocation isolation: put allocated buffers to isolated locations

– Zero-filling– Do the above during re-execution for failed code region only

Other Wrappers

Message wrapper (in proxy)– Randomly shuffle message orders of different connections

while keeping the message order of the same connection– Randomize packet sizes

Process scheduling: change process’ priority Signal delivery: randomize hw interrupt delivery time

while preserving order Dropping user requests

– Binary search for bad requests– Drop at most 10% of requests

Proxy

Control Unit

Coordinate checkpoint/roll back, environment changes etc. Failure vector <S1, S2, …, Sm> per failure symptom (exception

type, PC adderss, call chain etc.)– Si is the score for environmental change #i– If change #i is successful, Si++; if failed, Si - -– Try the changes with scores greater than a certain threshold first

Outline


Setup

A client machine and a server machine– 2.4GHz x86 CPU, 512KB L2 cache, 1GB DRAM– 100Mbps Ethernet

Injected bugs

Overall Results

Checkpoint Overhead

Time: with checkpoint interval of 200ms, 5% overhead (MySQL)

Workloads:

• apache, squid: 5 threads, GET files with size uniform [1KB, 512KB]

• CVS: client exports a 30KB file

• MySQL: 5 client threads, transactions on a small table

Summary

Rx: re-executing the buggy program region in a modified execution environment

Not panacea:– Semantic bugs, resource leaks– Latent bugs (long delay from bug to symptom)

Shimin Chen LBA Reading Group Presentation

Documents

sw bugs

expensiveprevious work

sw company

memoryrelated bugs

process control

sw failuresos

recent checkpoint

deterministic bugswhat