Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures.

Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures at the Users Site Slide 2 Motivation Software failures are a major contributor to system downtime. Security holes. Software has grown in size, complexity and cost. Software testing has become more difficult. Software packages inevitably contain bugs (even production ones). Slide 3 Motivation Result: Software failures during production runs at users site. One Solution: Offsite software diagnosis: Difficult to reproduce failure triggering conditions. Cannot provide timely online recovery (e.g. from fast Internet Worms). Programmers cannot be provided to every site. Privacy concerns. Slide 4 Goal: automatically diagnosing software failures occurring at end-user site production runs. Understand a failure that has happened. Find the root causes. Minimize manual debugging. Slide 5 Current state of the art Offsite diagnosis:Primitive onsite diagnosis: Interactive debuggers. Program slicing. Core Dump analysis (Partial execution path construction). Large overhead makes it impractical for production sites. Unprocessed failure information collections. Deterministic replay tools. All require manual analysis. Privacy concerns. Slide 6 Onsite Diagnosis Efficiently reproduce the occurred failure (i.e. fast and automatically). Impose little overhead during normal execution. Require no human involvement. Require no prior knowledge. Slide 7 Triage Capturing the failure point and conducting just-in-time failure diagnosis with checkpoint-reexecution. Delta Generation and Delta Analysis. Automated top-down human-like software failure diagnosis protocol. Reports: Failure nature and type. Failure-triggering conditions. Failure-related code/variable and the fault propagation chain. Slide 8 Triage Architecture 3 groups of components: 1. Runtime Group. 2. Control Group. 3. Analysis Group. Slide 9 Checkpoint & Reexecution Uses Rx (Previous work by authors). Rx checkpointing: Use fork()-like operations. Keeps a copy of accessed files and file pointers. Record messages using a network proxy. Replay may be potentially modified. Slide 10 Lightweight Monitoring for detecting failures Must not impose high overhead. Cheapest way: catch fault traps: Assertions Access violations Divide by zero More Extensions: Branch histories, system call trace Triage only uses exceptions and assertions. Slide 11 Control layer Implements the Triage Diagnosis protocol. Controls reexecutions with different inputs based on past results. Choice of analysis technique. Collects results and sends to off-site programmers. Slide 12 Analysis Layer Techniques: Slide 13 TDP: Triage Diagnosis Protocol Simple Replay Coredump analysis Dynamic bug detection Delta Generation Delta Analysis Deterministic bugStack/Heap OK. Segmentation fault: strln() Null-pointer dereference Collection of good and bad inputs Code paths leading to fault Report Slide 14 TDP: Triage Diagnosis Protocol Example report Slide 15 Protocol extensions and variations Add different debugging techniques. Reorder diagnosis steps. Omit steps (e.g. memory checks for java programs). Protocol may be costume-designed for specific applications. Try and fix bugs: Filter failure triggering inputs. Dynamically delete code risky. Change variable values. Automatic patch generation future work? Slide 16 Delta Generation Two Goals: 1. Generate many similar replays: some that fail and some that dont. 2. Identify signature of failure triggering inputs. Signatures may be used for: Failure analysis and reproduction. Input filtering e.g. Vigilante, Autograph,etc. Slide 17 Delta Generation Changing the inputChanging the Environment Replay previously stored client requests via proxy try different subsets and combinations. Isolate bug-triggering part data fuzzing. Find non-failing inputs with minimum distance from failing ones. Make protocol aware changes. Use a normal form of the input, if specific triggering portion is known. Pad or zero-fill new allocations. Change messages order. Drop messages. Manipulate thread scheduling. Modify the system environment. Make use of prior steps information (e.g. target specific buffers). Slide 18 Delta Generation Results passed to the next stage: Break code to basic blocks. For each replay extract a vector of exercise count of each block and block trace. Possible to change granularity. Slide 19 Example revisited Good run: Trace: AHIKBDEFEFEG Block vector: {A:1,B:1,D:1,E:11,F:10,G:1,H:1,I:1,K:1} Bad run: Trace: AHIJBCDE Block vector: {A:1,B:1,C:1,D:1,E:1,H:1,I :1,J:1,K:1} Slide 20

Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures.

Documents

failure analysis

users site slide

time failure diagnosis

occurred failure

failure nature

failure point

failure triggering conditions

offsite software diagnosis