A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau
Jan 03, 2016
A Software Layer for Disk Fault Injection
Jake Adriaens
Dan Gibson
CS 736 Spring 2005
Instructor: Remzi Arpaci-Dusseau
Outline
1. Introduction, Motivation, & Challenges
2. Related Work
3. Implementation Details & IDE Driver
4. Fault Model
5. Methods & Evaluation
6. Summary
Overview - 1
Software system for modeling IDE disk faults in an x86/Linux-based computer
Modification to IDE driver for read/write event interception
Overview - 2
Disks faults described at a high level Faults passed to kernel-level module On read/write event:
– IDE driver calls kernel module to perform request modification
– Before write event, module may modify data to-be-written
– After read event, module may modify data read from disk
Motivation – Why purposely cause disk failures?
Commodity HW (and SW!) fails, usually at unexpected times– Causing failures at expected times can help improve
fault tolerance measures
Can be used to determine fault tolerance of systems– Various flavors of RAID need fault injection
Motivation
Faults can happen at the worst time– In the middle of a PowerPoint presentation…
Challenges
Drivers are typically written with reliability in mind– May have error detection / correction measures
Should these be removed? Fooled? Applauded?
Low-level drivers critically affect performance and stability of the system– Disk faults need not be “stable,” but shouldn’t have
unusual “side effects”
Challenges
Failure models difficult to justify– Disk manufacturers don’t offer details on how/why
their disks fail Failstop model is widely used: models complete, detected
disk failure Other models must be chosen generally to account for
many different disks, controllers, etc.
Outline
1. Introduction, Motivation, & Challenges
2. Related Work
3. Implementation Details & IDE Driver
4. Fault Model
5. Methods & Evaluation
6. Summary
Related Work
Software fault injection– Huang et. al. (and many others) use software fault
injection for modifying cached web pages (ACM/ProcWWW)
– Jarboui et. al. inject software faults into the Linux kernel and observe system behavior
– Nagaraja et. al. inject faults into cluster-based systems
Related Work
Disk Faults, Modeling, Detection– Kaaniche et. al. inject disk faults to study RAID
behavior– Kari et. al. presents fault detection and diagnosis
techniques (separate studies)– Various other RAID and/or FS papers use some
form of fault injection to model failures
Related Work
Hardware Fault Injection
Outline
1. Introduction, Motivation, & Challenges
2. Related Work
3. Implementation Details & IDE Driver
4. Fault Model
5. Methods & Evaluation
6. Summary
Implementation
Core components– User-level parser– In-kernel injection module– In-driver upcalls– System calls
Added ~20 lines to IDE driver code Kernel module is demand-loaded, ~250 lines in size 2 System calls, inject_fault and getdrivesize, ~ 120
lines
Implementation – User-level Console
Used for fault definition– Console interface for
fault definition– Processes batch files– Checks faults for validity
Sector ranges, probability, etc. (more later)
– Passes faults to kernel module
Implementation – IDE Driver Modification
Added “upcalls” to injection module– Pass I/O requests to module for modification– Provide callback service on I/O completion
Added special-purpose code for certain fault models– Failstop model requires in-driver actions
Implementation – Kernel Module
Receives fault lists from user-level console Called by IDE driver to perform insertion when:
– LBA sector (SCSI-like) becomes known – sector may be modified
– Write is initiated – data to be written may be modified
– Read completes – data may be modified before returning control to I/O initiator
Implementation – System Calls
Added two system calls– inject_faults()
Used to pass fault definitions to kernel module from user space
– getsectors() Used to determine raw sector ranges of IDE devices by
name (there are other ways to do this)
ImplementationFaults Defined
Faults Injected
Disk Request
I/O Initiated Upcall
Modified Request
Bus TrafficI/O Returns
Control Returns
IDE Driver (2.4.26 Linux Kernel)
Important structures– struct request
Information about an IDE request– READ / WRITE– Number of sectors– Etc
– struct ide_drive_s (_t) Information about a drive
– Drive name (eg. “hdc”)– Sizing/addressing information– Etc
IDE Driver (2.4.26 Linux Kernel)
Functions– ide_do_rw_disk (3 versions)
Common choke-point for reads & writes Many other similar functions, only this one in use Two versions, swapped by preprocessor directives (one for
DMA, one for PIO)
Outline
1. Introduction, Motivation, & Challenges
2. Related Work
3. Implementation Details
4. Fault Model
5. Methods & Evaluation
6. Summary
Failure Model
Models selected to represent “generic IDE” disk– No modeling of specific failure (i.e. Western Digital’s
“classic” servo malfunction)– Models based on ranges of affected logical sectors
(ala SCSI)
Failure Model – Fault Types
sectorfail– Models inability of a given sector (block) or sector
range to store data reliably– Excited on read of sector:
Data read is permuted in some way:– Randomized – Set to specific value – Added to offset – Shifted by one or more bytes
Failure Model – Fault Types
sectorro– Writes to block have no effect on stored value– Excited on writes to sector:
Write requests ignored
sectorwrong– Traffic to a given block is directed to a different
block– Excited on reads & writes
Address permuted, similarly to data
Failure Model – Fault Types
transaddr– Sector number wrong for first fault excitation, but
right for all others– Excited on reads & writes
Sector permuted as in sectorwrong
transdata– Data is wrong for first fault excitation
Data permuted as in sectorfail
Failure Model – Fault Types
failstop– Drive is totally unresponsive—performs no reads or
writes– Differs from traditional Failstop in that our failstop is invisible
Drive does not report any errors, simply fails to perform reads or writes to any sector
Outline
1. Introduction, Motivation, & Challenges
2. Related Work
3. Implementation Details
4. Fault Model
5. Methods & Evaluation
6. Summary
Verification of Faults (?)
Faults excited and observed by microbenchmarks tailored to individual fault types
Techniques similar to latent fault detection (Kari et. al., and other studies)
Verification of faults is fault-specific
Verification - sectorfail
Corrupts data when read from disk1. Write known data to disk - observe location using
printk statement
2. Inject sectorfail fault at location of file on disk.
3. Unmount/remount FS (flush cache)
4. Attempt to read faulty file (with cat)
Verification - sectorro
Ignores writes to a given location1. Write known data to disk
2. Inject sectorro fault
3. Flush file cache
4. Write different data to same location
5. Flush file cache
6. Read data from (1) from disk
Verification - sectorwrong
Changes address (sector) to another sector number
1. Write known data to disk
2. Flush file cache
3. Inject sectorwrong fault—redirect to known location
4. Read from file – observe data from other sector
Verification - transdata
Data modified after read, but only the first time
1. Verify sectorfail functionality
2. Flush file cache
3. Re-read, expect correct data
Verification - transaddr
Sector number modified before reads & writes1. Verify sectorwrong functionality
2. Flush file cache
3. Repeat read, expect correct data
Verification - failstop
Easy!1. Install failstop fault
2. Attempt to access any portion of affected drive
3. Expect bad things– Usually causes kernel panic
Evaluation
Execution time overhead of injection SW
– Overhead << standard dev. of runtime for unaffected regions of disk space
– Overhead << standard dev. of runtime for affected regions
– Averaged over 250 accesses
Avg. (ms) Std.Dev.
No injection 3.025 0.075
Unaffected region 3.020 0.076
Affected Region
Outline
1. Introduction, Motivation, & Challenges
2. Related Work
3. Implementation Details
4. Fault Model
5. Methods & Evaluation
6. Summary
Summary
Present five new failure models for disk accesses, and the ability to inject them
Verified fault manifestation– Did not verify potential side effects ?
Fault injection has no noticeable effect on access times– Small SW overhead much smaller than access time
to physical device