Top Banner
Data Storage Lab Benchmarking for Observability: The Case of Diagnosing Storage Failures Duo Zhang, Mai Zheng Department of Electrical and Computer Engineering, Iowa State University, USA
51

Benchmarking for Observability

Apr 09, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Benchmarking for Observability

Data Storage Lab

Benchmarking for Observability: The Case of Diagnosing Storage Failures

Duo Zhang, Mai Zheng

Department of Electrical and Computer Engineering, Iowa State University, USA

Page 2: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

1

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 3: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

2

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 4: Benchmarking for Observability

The Storage Stack & Failures

• The storage stack in OS kernel is too complex to be bug-free • May contain many classic bugs

• E.g., data races, deadlock, dangling pointers, buffer overflow, crash-consistency bugs, …• EXPLODE@OSDI’06• Lu et. al.@FAST’13• HYDRA@SOSP’19• KRace@S&P’20• …

• Being optimized aggressively for non-volatile memories (NVM) which might introduce new bugs• E.g., Duo et. al.@SYSTOR’21

3

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

Page 5: Benchmarking for Observability

The Storage Stack & Failures

• Bugs may lead to various storage failures in practice, which are often difficult & time-consuming to diagnose• E.g., Algolia data center incident:

• Servers crashed and files corrupted for unknown reason

• After weeks of diagnosis, Samsung SSDs were mistakenly blamed

• After one month, a Linux kernel bug was identified as root cause

4

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

Page 6: Benchmarking for Observability

The Storage Stack & Failures

• Bugs may lead to various storage failures in practice, which are often difficult & time-consuming to diagnose• E.g., Algolia data center incident:

• Servers crashed and files corrupted for unknown reason

• After weeks of diagnosis, Samsung SSDs were mistakenly blamed

• After one month, a Linux kernel bug was identified as root cause

5

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

More advanced debugging support is needed!

Page 7: Benchmarking for Observability

Limitations of Existing Tools?

• Three categories of debugging tools exist for failure diagnosis• (1) Interactive debuggers

• Support fine-grained manual inspection

• Set breakpoints and check its variable value

• Go forward/backward steps to check pre/after information

• E.g., GDB/KDB

6

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

https://www.gnu.org/software/gdb/Debug with GDB

Page 8: Benchmarking for Observability

Limitations of Existing Tools?

• Three categories of debugging tools exist for failure diagnosis• (2) Tracers

• Collect various events from a running system automatically

• Software tracers, e.g., Ftrace

• Hardware tracers, e.g., Storage Protocol Analyzer

7

Teledyne LeCroy Summit T34 Protocol Analyzer

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

https://blog.selectel.com/kernel-tracing-ftrace/

Page 9: Benchmarking for Observability

Limitations of Existing Tools?

• Three categories of debugging tools exist for failure diagnosis• (3) Record & replay tools

• Record program executions and replay segments of the past execution

• Instructions, non-determinism inputs, and snapshots

• E.g., TTVM@USENIX ATC’05, PANDA

8

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

TTVM framework https://panda-re.mit.edu/

Page 10: Benchmarking for Observability

Limitations of Existing Tools?

• Little measurement of their limitations for debugging!• Existing efforts simply measure tools’ runtime overhead

• Cannot tell how effective the tools are for diagnosing the root causes of failures

9

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

Page 11: Benchmarking for Observability

Limitations of Existing Tools?

• Little measurement of their limitations for debugging!• Existing efforts simply measure tools’ runtime overhead

• Cannot tell how effective the tools are for diagnosing the root causes of failures

10

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

More benchmarking effort is needed!

Page 12: Benchmarking for Observability

What Should We Measure?

• Quinn et. al.@HotOS’19• Observability:

• “The observations that they (debugging tools) allow developers to make”

• Three high-level properties:• Visibility

• Repeatability

• Expressibility

• No qualitative/quantitative metrics for measurement

11

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

Page 13: Benchmarking for Observability

What Should We Measure?

• Quinn et. al.@HotOS’19• Observability:

• “The observations that they (debugging tools) allow developers to make”

• Three high-level properties:• Visibility

• Repeatability

• Expressibility

• No qualitative/quantitative metrics for measurement

12

Benchmarking for Observability: The Case of Diagnosing Storage Failures Background & Motivation

More concrete metrics are needed!

Page 14: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

13

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 15: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

14

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 16: Benchmarking for Observability

Methodology

15

Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures

• Collect data corruption related incidents reported on kernel Bugzilla• https://bugzilla.kernel.org/

• Focus on storage related components• E.g., File system, IO/Storage

Advanced search on Bugzilla

Page 17: Benchmarking for Observability

Methodology

16

• Collected information• Reported time• Last comment time• Kernel version• Component(s) involved• Number of comments• Number of participants

• Calculated information• Time of duration

An Example of Incident Page

Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures

Page 18: Benchmarking for Observability

Characteristics of Failure Incidents

• Observation 1• The incidents took multiple months to resolve on average, required

multiple rounds of discussion and multiple participants• Quantitatively show the difficulty of diagnosing storage failures

17

Group Count(%)

Avg. Days

Avg. Comments/Participants

Resolved 136 (49.1%) 146.9 8/3

Unresolved 141 (50.9%) 1444.2 5/2

Overall 277 807.3 6/2

Table: Summary of Incidents

Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures

Page 19: Benchmarking for Observability

Characteristics of Failure Incidents

• Observation 2• The incidents involve all major storage components

• Consistent with previous studies [e.g., Duo et.al.@SYSTOR’21]

• Debugging tools need to provide full-stack observability!

18

Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures

Page 20: Benchmarking for Observability

Characteristics of Failure Incidents

• Observation 3• The average debugging time is consistently long across different components

• Better debugging support is needed for all components!

19

Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures

Page 21: Benchmarking for Observability

Characteristics of Failure Incidents

• Observation 4• 37 out of 136 (26.3%) resolved issues involve multiple OS distributions or kernel

versions• Bugs may elude intensive testing and sneak into

new releases • Testing in the development environment is not enough• Debugging support for failure diagnosis is always

needed in practice

• Observation 5• Only 5 out of 136 (3.7%) resolved issues were caused by hardware

• Software bugs remain dominant for causing storage failures• Observing the behavior of the storage software stack is important!

20

Benchmarking for Observability: The Case of Diagnosing Storage Failures Understanding Storage Failures

Page 22: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

21

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 23: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

22

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 24: Benchmarking for Observability

BugBenchk

• A collection of realistic, reproducible storage failure cases • Include all necessary information for

reproducing the cases• Bug-triggering workloads

• C/Bash based on the report• Environmental information

• OS distribution• Specific kernel version• System configurations

• Root causes• Critical functions containing bugs

• Packaged as virtual machine (VM) images• Portable and convenient

• Enable realistic benchmarking & measurements of debugging tools

23

Benchmarking for Observability: The Case of Diagnosing Storage Failures Deriving BugBenchk

OS:Ubuntu18.10 with kernel 4.19.0

Configuration:GRUB_CMDLINE_LINUX_DEFAULT="fsck.mode=force fsck.repair=no emergency scsi_mod.use_blk_mq=1

Workload:Install and uninstall multiple packages multiple times

Other requirements:Run with small memorye.g., 300MB

Page 25: Benchmarking for Observability

BugBenchk

• The current prototype includes 9 cases• Covering 4 different file systems as well as the block I/O layer

24

Case ID Critical Function (partial) Bug Type Bug Size WKLD Size

1-EX4 ext4_do_update_inode, ext4_clear_inode_state Semantics 8 70 (C)

2-EX4 parse_options Semantics 6 55 (C & Bash)

3-BTRFS btrfs_ioctl_snap_destroy, btrfs_set_log_full_commit Semantics 71 54 (C)

4-BTRFS btrfs_log_trailin_hole Semantics 121 61 (C)

5-BTRFS btrfs_log_all_parents, btrfs_record_unlink_dir Semantics 13 58 (C)

6-F2FS f2fs_submit_page_bio, f2fs_is_valid_blkaddr Memory 94 141 (C & Bash)

7-GFS gfs2_check_sb, fs_warn Memory 18 2 (Bash)

8-BLK blkdev_fsync, sync_blkdev Semantics 12 43 (C)

9-BLK __blk_mq_issue_directly, blk_mq_requeue_request Semantics 9 17 (Bash)

Benchmarking for Observability: The Case of Diagnosing Storage Failures Deriving BugBenchk

Page 26: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

25

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 27: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

26

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 28: Benchmarking for Observability

Experiment Overview

• Evaluated two state-of-the-art debugging tools via BugBenchk

• FTrace: a tracing utility built directly into the Linux kernel• PANDA: a VM-based record & replay platform for program analysis

• Measured a concrete set of metrics according to tools’ features and identify limitations in both tools• E.g., zero observability in case of kernel panics

• Enhanced the observability of both tools by adding command-level information

27

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Page 29: Benchmarking for Observability

Experiment Result: FTrace

• Key Feature Measured• Kernel function tracing

# tracer: function## TASK-PID CPU# TIMESTAMP EXECUTION TIME FUNCTION# | | | | | |

<idle>-0 [000] 2965.191280657: funcgraph_entry: | switch_mm_irqs_off() { <idle>-0 [000] 2965.191281919: funcgraph_entry: 0.283 us | load_new_mm_cr3(); <idle>-0 [000] 2965.191282543: funcgraph_exit: 2.175 us | }

fsync1-5526 [000] 2965.191283332: funcgraph_entry: | finish_task_switch() { fsync1-5526 [000] 2965.191283719: funcgraph_entry: | smp_irq_work_interrupt() {

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

28

Page 30: Benchmarking for Observability

Experiment Result: FTrace

• Key Observations• All 9 cases in BugBenchk can still be reproduced when applying FTrace

• i.e., FTrace is non-intrusive for failure diagnosis

29

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Page 31: Benchmarking for Observability

Experiment Result: FTrace

• Key Observations• Provide rich function-level information for diagnosis

• The unique function is much fewer compared to all functions traced

30

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Page 32: Benchmarking for Observability

Experiment Result: FTrace

• Key Observations• Cannot trace all critical functions!

• Limited by the available_filter_functions file in debugfs

31

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Page 33: Benchmarking for Observability

Experiment Result: FTrace

• Key Observations• Can trace closely related functions (e.g., parent functions) even if it missed

the critical function itself

32

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Page 34: Benchmarking for Observability

Experiment Result: FTrace

• Key Observations• Cannot help at all in case of “6-F2FS” and “7-GFS” (i.e., zero observability)

• Both cases cause kernel panics, which affect the in-kernel FTrace

33

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Page 35: Benchmarking for Observability

Experiment Result: FTrace

34

Built-in tracer is fundamentally limited for diagnosing severe storage failures

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

• Key Observations• Cannot help at all in case of “6-F2FS” and “7-GFS” (i.e., zero observability

• Both cases cause kernel panics, which affect the in-kernel FTrace

Page 36: Benchmarking for Observability

Experiment Result: FTrace

• Key Observations• May generate substantial traces which may dilute the debugging focus

35

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Page 37: Benchmarking for Observability

Experiment Result: FTrace

• Key Observations• May generate substantial traces which may dilute the debugging focus

36

More intelligent debugging method is needed to reason the root cause automatically!

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Total # of Func. Traced

Total # of Unique Func.

Critical Func. Observed

Shortest Distance

1-EX4 Yes 12,506 (±4.1%) 1,152 (±0.7%) 1/7 -

2-EX4 Yes 54,796 (±2.3%) 1,436 (±15.9%) 0/1 2

3-BTRFS Yes 46,370 (±5.6%) 1,339 (±1.5%) 3/6 -

4-BTRFS Yes 92,476 (±5.5%) 1,381 (±1.0%) 0/1 1

5-BTRFS Yes 30,528 (±3.6%) 1,419 (±1.5%) 3/4 -

6-F2FS Yes 0 0 0/7 -

7-GFS Yes 0 0 0/2 -

8-BLK Yes 6,876 (±2.7%) 901 (±4.3%) 1/2 -

9-BLK Yes 110,772,722 (±6.4%) 1,165 (±0.8%) 2/3 -

Page 38: Benchmarking for Observability

Experiment Result: PANDA

• Key Features Measured• Record & replay• 4 Plugins

37

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes

Process-Block Relationship

1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √

9-BLK No N/A N/A N/A N/A N/A

Page 39: Benchmarking for Observability

Experiment Result: PANDA

• Key Observations • PANDA can be applied to diagnose 8 out of the 9 cases in BugBenchk

38

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes

Process-Block Relationship

1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √

9-BLK No N/A N/A N/A N/A N/A

Page 40: Benchmarking for Observability

Experiment Result: PANDA

• Key Observations • Can handle severe failures (e.g., kernel panics) that FTrace cannot deal with

• By leveraging VM

39

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes

Process-Block Relationship

1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √

9-BLK No N/A N/A N/A N/A N/A

Page 41: Benchmarking for Observability

Experiment Result: PANDA

• Key Observations • Can handle severe failures (e.g., kernel panics) that FTrace cannot deal with

• By leveraging VM

40

Isolating the target storage software stack from the tool itself is important!

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes

Process-Block Relationship

1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √

9-BLK No N/A N/A N/A N/A N/A

Page 42: Benchmarking for Observability

Experiment Result: PANDA

• Key Observations • PANDA failed in“9-BLK”

• Heavy non-deterministic events overwhelm the full-stack recording

41

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes

Process-Block Relationship

1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √

9-BLK No N/A N/A N/A N/A N/A

Page 43: Benchmarking for Observability

Experiment Result: PANDA

• Key Observations • PANDA failed in“9-BLK”

• Heavy non-deterministic events overwhelm the full-stack recording

42

Reducing the overhead of VM-based full stack tool is critically important!

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Case ID Still Reproducible?

Record &

Replay

Plugin

Show Instructions

Taint Analysis

Identify Processes

Process-Block Relationship

1-EX4 Yes √ √ √ √ √

2-EX4 Yes √ √ √ √ √

3-BTRFS Yes √ √ √ √ √

4-BTRFS Yes √ √ √ √ √

5-BTRFS Yes √ √ √ √ √

6-F2FS Yes √ √ √ √ √

7-GFS Yes √ √ √ √ √

8-BLK Yes √ √ √ √ √

9-BLK No N/A N/A N/A N/A N/A

Page 44: Benchmarking for Observability

Experiment Result: Extensions

• Neither tools can provide complete, full-stack observability• Missing the lowest-level of the storage software stack

• i.e., the communications between the OS kernel and the storage device

• Typically relies on special hardware support to collect such information (e.g., storage protocol analyzer)

43

Teledyne LeCroy Summit T34 Protocol Analyzer Trace View Software

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Page 45: Benchmarking for Observability

Experiment Result: Extensions

• FTrace Extension• Intercept device commands via a

customized iSCSI driver • Stitch kernel functions with device

commands based on timestamp• Enhance the observability by

combining function-level information with command-level information• E.g., missing a SYNC_CACHE command

in a fync system call code path is problematic

44

FTrace with extended command-level information

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Page 46: Benchmarking for Observability

Experiment Result: Extensions

45

• PANDA Extension• Intercept device commands

through customized QEMU• Record device commands

together with instruction trace

• Enhance the observability by combining instruction-level information with command-level information

QEMU with extended command-level information

Benchmarking for Observability: The Case of Diagnosing Storage Failures Measuring Observability

Page 47: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

46

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 48: Benchmarking for Observability

Outline

• Background & Motivation

• Understanding Real-World Storage Failures

• Deriving BugBenchk

• Measuring Observability

• Conclusion and Future Work

47

Benchmarking for Observability: The Case of Diagnosing Storage Failures Outline

Page 49: Benchmarking for Observability

Conclusion• Derived BugBenchk from real-world storage failures

• Enable realistic benchmarking of debugging tools

• Measured the observability of representative debugging tools via BugBenchk

• Tracers relying on built-in tracepoints/probes/instrumentations may be fundamentally limited• Zero observability when the storage failure is too severe

• This is also when the debugging support is needed the most

• VM-based record & replay tools may be more viable due to isolation• But heavy overhead can make a failure un-reproducible

• Both methods may generate substantial information• Still need much human effort to understand the root cause

• Both methods may miss the low-level information• can be remedied by extensions

• Less intrusive and more intelligent solutions are needed for enhancing the debugging observability• More benchmarking efforts can guide the design of new solutions

48

Benchmarking for Observability: The Case of Diagnosing Storage Failures Conclusion and Future Work

Page 50: Benchmarking for Observability

Future Work

• Enrich BugBenchk

• Add more diverse cases • E.g., driver bugs, persistent-memory specific bugs

• Derive more metrics and evaluate more tools • E.g., quantify human efforts needed to understand the root cause given

the limited observability provided by existing tools

• Build intelligent debugging tools to reduce debugging efforts• E.g., kernel-level data provenance tracking & querying

49

Benchmarking for Observability: The Case of Diagnosing Storage Failures Conclusion and Future Work

Page 51: Benchmarking for Observability

Future Work

• Enrich BugBenchk

• Add more diverse cases • E.g., driver bugs, persistent-memory specific bugs

• Derive more metrics and evaluate more tools • E.g., quantify human efforts needed to understand the root cause given

the limited observability provided by existing tools

• Build intelligent debugging tools to reduce debugging efforts• E.g., kernel-level data provenance tracking & querying

50

Benchmarking for Observability: The Case of Diagnosing Storage Failures Conclusion and Future Work

Thanks!