Top Banner
Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures. Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05 Shimin Chen LBA Reading Group Presentation
25

Shimin Chen LBA Reading Group Presentation

Feb 05, 2016

Download

Documents

hanley

Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures. Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05. Shimin Chen LBA Reading Group Presentation. Motivation. High availability is important Critical applications: process control, etc. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Shimin Chen LBA Reading Group Presentation

Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures.

Qin, Tucek, Sundaresan, Zhou (UIUC). SOSP’05

Shimin ChenLBA Reading Group Presentation

Page 2: Shimin Chen LBA Reading Group Presentation

Motivation

High availability is important– Critical applications: process control, etc.– Financial company: an hour of downtime costs $6 million

SW defects account for up to 40% of system failures– Common: memory-related bugs and concurrency bugs

Bugs still occur in production runs– Even after SW company spends enormous effort on testing

Ask for mechanisms for surviving software bugs

Page 3: Shimin Chen LBA Reading Group Presentation

Previous Work on Surviving SW Failures

Four categories:– Rebooting– Checkpointing and recovery– Application-specific mechanisms– Recent proposals:

Failure-oblivious computing Reactive immune system

Page 4: Shimin Chen LBA Reading Group Presentation

Previous Work 1: Rebooting

Schemes:– Whole program restart– Micro-rebooting of partial system components– SW rejuvenation (proactively restart processes)

Problem:– Cannot deal with deterministic bugs– Restart time

Page 5: Shimin Chen LBA Reading Group Presentation

Previous Work 2: General checkpointing and recovery

Schemes:– Checkpoint, rollback, re-execute– Or use a backup server

Problems:– Cannot deal with deterministic bugs

Progressive retry in distributed systems: – Reorder messages to get around SW bugs, but not bugs on

single system N-version programming:

– Too expensive

Page 6: Shimin Chen LBA Reading Group Presentation

Previous Work 3: Application-Specific Recovery Mechanisms

Multi-process model (MPM)– Kill a request-handling process and start a new one

Problems:– Cannot handle deterministic bugs– What if shared data structure is corrupted?

Page 7: Shimin Chen LBA Reading Group Presentation

Previous Work 4: Recent Non-Conventional Proposals

Failure-oblivious computing– Manufacture values for out-of-bound reads– Discard out-of-bound writes

Reactive immune system– Detect failures of function calls– Forcefully return from the function with a manufactured

error return value (e.g. -1 for int, 0 for unsigned int etc.)

Problem:– Unsafe for correctness-critical applications (e.g. banking)

Page 8: Shimin Chen LBA Reading Group Presentation

New Proposal: Rx

Rollback the program to a recent checkpoint when a bug is detected

Dynamically change the execution environment based on the failure symptoms

Re-execute the buggy code in the new environment Features:

– Comprehensive: can deal with deterministic bugs– Safe: do not speculatively “fix” bugs, but change environment– Noninvasive: no changes to app source code– Efficient– Informative: help locating the bugs

Page 9: Shimin Chen LBA Reading Group Presentation

Outline

Introduction Main Idea of Rx Rx Design & Implementation Evaluation Summary

Page 10: Shimin Chen LBA Reading Group Presentation

Main Idea

Record the changes for offline diagnosis

Page 11: Shimin Chen LBA Reading Group Presentation

Useful Execution Environmental Changes

Must be safe and may avoid bugs Memory management based

– Buffer overflows, dangling pointers, etc.

Timing based– Concurrency bugs

User request based– Dropping unexpected (malicious) user request– As a last resort

Page 12: Shimin Chen LBA Reading Group Presentation
Page 13: Shimin Chen LBA Reading Group Presentation

Outline

Introduction Main Idea of Rx Rx Design & Implementation Evaluation Summary

Page 14: Shimin Chen LBA Reading Group Presentation

Rx Components Overview

1

23

4

5

Page 15: Shimin Chen LBA Reading Group Presentation

Sensors for Detecting SW Failures

OS-raised exceptions:– Assertion failures, segfault, divide-by-zero, etc.

Fine-grain detection: – buffer overflow, accesses to freed memory etc.

Only implemented OS-raised exceptions

Page 16: Shimin Chen LBA Reading Group Presentation

Checkpoint and Rollback (Flashback)

Memory state: fork-like operation Files: keep a copy of each accessed files and file pointers for a

checkpoint Checkpoint management:

– Equal intervals or exponential landmarks– Limit oldest checkpoint by considering recovery time goal

Multi-threaded process checkpointing– Send a signal to all threads to make them exit from blocked

syscalls with EINTR– Take checkpoint– Library wrapper in Rx retries syscalls– High cost so cannot be frequent

Page 17: Shimin Chen LBA Reading Group Presentation

Environment Wrappers

Memory wrapper: (intercepting library calls)– Delaying free:

keep a freed buffer for a threshold (process) time FIFO recycling

– Padding buffers: adds two fixed-size padding to both ends of allocated buffers

– Allocation isolation: put allocated buffers to isolated locations

– Zero-filling– Do the above during re-execution for failed code region only

Page 18: Shimin Chen LBA Reading Group Presentation

Other Wrappers

Message wrapper (in proxy)– Randomly shuffle message orders of different connections

while keeping the message order of the same connection– Randomize packet sizes

Process scheduling: change process’ priority Signal delivery: randomize hw interrupt delivery time

while preserving order Dropping user requests

– Binary search for bad requests– Drop at most 10% of requests

Page 19: Shimin Chen LBA Reading Group Presentation

Proxy

Page 20: Shimin Chen LBA Reading Group Presentation

Control Unit

Coordinate checkpoint/roll back, environment changes etc. Failure vector <S1, S2, …, Sm> per failure symptom (exception

type, PC adderss, call chain etc.)– Si is the score for environmental change #i– If change #i is successful, Si++; if failed, Si - -– Try the changes with scores greater than a certain threshold first

Page 21: Shimin Chen LBA Reading Group Presentation

Outline

Introduction Main Idea of Rx Rx Design & Implementation Evaluation Summary

Page 22: Shimin Chen LBA Reading Group Presentation

Setup

A client machine and a server machine– 2.4GHz x86 CPU, 512KB L2 cache, 1GB DRAM– 100Mbps Ethernet

Injected bugs

Page 23: Shimin Chen LBA Reading Group Presentation

Overall Results

Page 24: Shimin Chen LBA Reading Group Presentation

Checkpoint Overhead

Time: with checkpoint interval of 200ms, 5% overhead (MySQL)

Workloads:

• apache, squid: 5 threads, GET files with size uniform [1KB, 512KB]

• CVS: client exports a 30KB file

• MySQL: 5 client threads, transactions on a small table

Page 25: Shimin Chen LBA Reading Group Presentation

Summary

Rx: re-executing the buggy program region in a modified execution environment

Not panacea:– Semantic bugs, resource leaks– Latent bugs (long delay from bug to symptom)