Top Banner
Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE HUANG, ZHI LIU, AHMED EL-HASSANY,SAM WHITLOCK, HRISHIKESH B. ACHARYA, KYRIAKOS ZARIFIS,ARVIND KRISHNAMURTHY, SCOTT SHENKER
51

Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Dec 18, 2015

Download

Documents

Leona Kelly
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Troubleshooting SDNControl Software withMinimal Causal Sequences

COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE HUANG, ZHI LIU, AHMED EL-HASSANY,SAM WHITLOCK, HRISHIKESH B. ACHARYA, KYRIAKOS ZARIFIS,ARVIND KRISHNAMURTHY, SCOTT SHENKER

Page 2: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Bugs are costly and time consuming

Software bugs cost US economy $59.6 Billion annually

Developers spend ~50% of their time debugging Best developers devoted to debugging

Page 3: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Distributed Systems are Bug-Prone

Distributed correctness faults:

Race conditions

Atomicity violations

Deadlock

Livelock

Page 4: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Example Bug (Floodlight, 2012)

Page 5: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Best Practice: Logs

Human analysis of log files

Page 6: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Best Practice: Logs

Page 7: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Best Practice: Logs

Page 8: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Our Goal

Allow developers to focus on fixing the underlying bug

Page 9: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Problem Statement

Identify a minimal sequence of inputs that triggers the bug in a blackbox fashion

Page 10: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Why minimization?

Smaller event traces are easier to understand

Page 11: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Minimal Causal Sequence

Page 12: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Minimal Causal Sequence

Page 13: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Minimal Causal Sequence

Page 14: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Where Bugs are Found

Symptoms found:• On developer’s local machine (unit and integration tests)

• In production environment

• On quality assurance testbed

Page 15: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Approach: Delta Debugging Replay

Page 16: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Approach: Modify Testbed

Page 17: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Testbed Observables

Invariant violation detected by testbed Event Sequence:

1.External events (link failures, host migrations,..) injected by testbed

2.Internal events (message deliveries) observed by testbed (incomplete)

Page 18: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Replay Definition

A replay of log L involves replaying the external events EL, possibly taking into account the occurrence of internal events IL

The output of replay is a sequence of configurations

Ideally replay(L) reproduces the original configuration sequence

Page 19: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Approach: Delta Debugging Replay

Events (link failures, crashes, host migrations) injected by test orchestrator

Page 20: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Key Point

Must Carefully Schedule Replay Events To Achieve Minimization!

Page 21: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Challenges

AsynchronyDivergent executionNon-determinism

Page 22: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Challenge: Asynchrony

Asynchrony definition: No fixed upper bound on relative speed of

processors No fixed upper bound on time for messages to be

delivered

Page 23: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Challenge: Asynchrony

Need to maintain original event order

Page 24: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Challenge: Asynchrony

Page 25: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Coping with Asynchrony

Use interposition to maintain causal dependencies

Page 26: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Challenge: Divergence

Divergence: Absent Internal Events

Prune Earlier Input..

Page 27: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Divergence: Absent Internal Events

Some Events No Longer Appear

Divergence: Absent Internal Events

Page 28: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Solution: Peek Ahead

Infer which internal events will occur

Page 29: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Challenge: Non-determinism

Coping With Non-Determinism

Replay multiple times per subsequence

Assuming i.i.d., probability of not finding bug modeled by:

If not i.i.d., override gettimeofday(), multiplex sockets, interpose on logging statements

Page 30: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Divergence: Syntactic Changes

Page 31: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Divergence: Syntactic Changes

Sequence Numbers Differ!

Page 32: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Solution: Equivalence Classes

Mask Over Extraneous Fields

Page 33: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Solution: Peek ahead

Page 34: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Divergence: Unexpected Events

Prune Input..

Page 35: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Divergence: Unexpected Events

Unexpected Events Appear

Page 36: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Solution: Emperical Heuristic

Theory:

• Divergent paths ->Exponential possibilities

Practice:

• Allow unexpected events through

Page 37: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Approach Recap

Replay events in QA testbed

Apply delta debugging to inputs

Asynchrony: interpose on messages

Divergence: infer absent events

Non-determinism: replay multiple times

Page 38: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Evaluation Methodology

Evaluate on 5 open source SDN controllers (Floodlight, NOX, POX, Frenetic, ONOS)

Quantify minimization for:

• Synthetic bugs

• Bugs found in the wild

Qualitatively relay experience troubleshooting with MCSes

Page 39: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Case Studies

Page 40: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Comparison to Naïve Replay

Naïve replay: ignore internal events

Naïve replay often not able to replay at all

• 5 / 7 discovered bugs not replayable

• 1 / 7 synthetic bugs not replayable

Naïve replay did better in one case

• 2 event MCS vs. 7 event MCS with our techniques

Page 41: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Qualitative Results

15 / 17 MCSes useful for debugging

• 1 non-replayable case (not surprising)

• 1 misleading MCS (expected)

Page 42: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Case Studies

Page 43: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Case Studies

Page 44: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Runtime

Page 45: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Scalability

Page 46: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Coping with Non-Determinism

Page 47: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Complexity

Complexity

Page 48: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Ongoing work

Formal analysis of approach

Apply to other distributed systems (databases, consensus protocols)

Investigate effectiveness of various interposition points

Integrate STS into ONOS (ON.Lab) development workflow

Page 49: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Related work

Thread Schedule Minimization

• Isolating Failure-Inducing Thread Schedules. SIGSOFT ’02.

• A Trace Simplification Technique for Effective Debugging of Concurrent Programs. FSE ’10.

Program Flow Analysis

• Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction. ISSTA ’07.

• Toward Generating Reducible Replay Logs. PLDI ’11.

Best-Effort Replay of Field Failures

• A Technique for Enabling and Supporting Debugging of Field Failures. ICSE ’07.

• Triage: Diagnosing Production Run Failures at the User’s Site.SOSP ’07.

Page 50: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Improvements

Parallelize delta debugging

Smarter delta debugging time splits

Apply program flow analysis to further prune

Compress time

Page 51: Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Conclusion

Possible to automatically minimize execution traces for SDN control software

System (23K+ lines of Python) evaluated on 5 open source SDN controllers (Floodlight,NOX, POX, Frenetic, ONOS) and one proprietary controller

Currently generalizing, formalizing approach