1 FT 101 FT 101 Jim Gray Microsoft Research http://research.microsoft.com/~gray/Talks/ 80% of slides are not shown (are hidden) so view with PPT to see them all Outline • Terminology and empirical measures •General methods to mask faults. •Software-fault tolerance •Summary
62
Embed
1 FT 101 FT 101 Jim Gray Microsoft Research gray/Talks/ 80% of slides are not shown (are hidden) so view with PPT to see.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
FT 101FT 101 Jim Gray Microsoft Research
http://research.microsoft.com/~gray/Talks/80% of slides are not shown (are hidden) so view with PPT to see them all
Outline• Terminology and empirical measures
• General methods to mask faults.• Software-fault tolerance
• Reliability / Integrity: does the right thing. (Also large MTTF)
• Availability: does it now. (Also small MTTR
MTTF+MTTRSystem Availability:if 90% of terminals up & 99% of DB up?
(=>89% of transactions are serviced on time).
• Holistic vs. Reductionist view
SecurityIntegrityReliability
Availability
3
High Availability System ClassesGoal: Build Class 6 Systems
Availability
90.%
99.%
99.9%
99.99%
99.999%
99.9999%
99.99999%
System Type
Unmanaged
Managed
Well Managed
Fault Tolerant
High-Availability
Very-High-Availability
Ultra-Availability
Unavailable(min/year)
50,000
5,000
500
50
5
.5
.05
AvailabilityClass
1234567
UnAvailability = MTTR/MTBFcan cut it in ½ by cutting MTTR or MTBF
4
Demo: looking at some nodes• Look at http://uptime.netcraft.com/ • Internet Node availability:
92% mean, 97% median
Darrell Long (UCSC) ftp://ftp.cse.ucsc.edu/pub/tr/– ucsc-crl-90-46.ps.Z "A Study of the Reliability of Internet Sites" – ucsc-crl-91-06.ps.Z "Estimating the Reliability of Hosts Using
the Internet" – ucsc-crl-93-40.ps.Z "A Study of the Reliability of Hosts on the
Internet" – ucsc-crl-95-16.ps.Z "A Longitudinal Survey of Internet Host
SYSTEM 8 20 21 YearsProblem: Systematic Under-reporting
8
Many Software Faults are SoftAfter Design Review
Code InspectionAlpha TestBeta Test10k Hrs Of Gamma Test (Production)
Most Software Faults Are TransientMVS Functional Recovery Routines 5:1Tandem Spooler 100:1Adams >100:1
Terminology:Heisenbug:Heisenbug: Works On RetryBohrbug:Bohrbug: Faults Again On Retry
Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984Gray: "Why Do Computers Stop", Tandem TR85.7, 1985Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.
9
Summary of FT Studies• Current Situation: ~4-year MTTF =>
Fault Tolerance Works.• Hardware is GREAT (maintenance and MTTF).• Software masks most hardware faults.• Many hidden software outages in operations:
– New Software.– Utilities.
• Must make all software ONLINE.• Software seems to define a 30-year MTTF ceiling.
• Reasonable Goal: 100-year MTTF. class 4 today => class 6 tomorrow.
10
Fault Tolerance vs Disaster Tolerance
• Fault-Tolerance: mask local faults– RAID disks– Uninterruptible Power Supplies– Cluster Failover
• Disaster Tolerance: masks site failures– Protects against fire, flood, sabotage,..– Redundant system and service at remote site.– Use design diversity
11
Outline
• Terminology and empirical measures
• General methods to mask faults.• Software-fault tolerance
• Summary
12
Fault Model• Failures are independent
So, single fault tolerance is a big win
• Hardware fails fast (blue-screen)
• Software fails-fast (or goes to sleep)
• Software often repaired by reboot:– Heisenbugs
• Operations tasks: major source of outage– Utility operations
– Software upgrades
13
Fault Tolerance Techniques• Fail fast modules: work or stop
Hardware Maintenance:On-Line Maintenance "Works" 999 Times Out Of 1000.
The chance a duplexed disc will fail during maintenance?1:1000
Risk Is 30x Higher During Maintenance=> Do It Off Peak Hour
Software Maintenance:Repair Only Virulent BugsWait For Next Release To Fix Benign Bugs
20
OK: So FarHardware fail-fast is easyRedundancy plus Repair is great (Class 7 availability) Hardware redundancy & repair is via modules.How can we get instant software repair?We Know How To Get Reliable Storage
RAID Or Dumps And Transaction Logs.We Know How To Get Available Storage
Fail Soft Duplexed Discs (RAID 1...N).
? ? How do we get reliable execution?How do we get reliable execution?? How do we get available execution?? How do we get available execution?
21
Outline
• Terminology and empirical measures
• General methods to mask faults.
• Software-fault tolerance• Summary
22
Key Idea Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance • Software automates / eliminates operators So, • In the limit there are only software & design faults.
Software-fault tolerance is the key to dependability.
INVENT IT!
} { }{
23
Software Techniques: Learning from Hardware
Recall that most outages are not hardware. Most outages in Fault Tolerant Systems are SOFTWAREFault Avoidance Techniques: Good & Correct design.After that: Software Fault Tolerance Techniques:
Modularity (isolation, fault containment) Design diversity N-Version Programming: N-different implementations Defensive Programming: Check parameters and data Auditors: Check data structures in backgroundTransactions: to clean up state after a failure
Paradox: Need Fail-Fast Software
24
Fail-Fast and High-Availability ExecutionSoftware N-Plexing: Design Diversity
N-Version ProgrammingWrite the same program N-Times (N > 3)Compare outputs of all programs and take majority vote
Process Pairs: Instant restart (repair)Use Defensive programming to make a process fail-fastHave restarted process ready in separate environment Second process “takes over” if primary faultsTransaction mechanism can clean up distributed state
if takeover in middle of computation.SESSION
PRIMARYPROCESS
BACKUPPROCESS
STATEINFORMATION
LOGICAL PROCESS = PROCESS PAIR
25
What Is MTTF of N-Version Program?First fails after MTTF/NSecond fails after MTTF/(N-1),...
so MTTF(1/N + 1/(N-1) + ... + 1/2)harmonic series goes to infinity, but VERY slowly
for example 100-version programming gives ~4 MTTF of 1-version programming
Reduces variance
N-Version Programming Needs REPAIRIf a program fails, must reset its state from other programs.=> programs have common data/state representation.How does this work for Database Systems?
Operating Systems?Network Systems?
Answer: I don’t know.
26
Why Process Pairs Mask Faults:Many Software Faults are Soft
After Design Review
Code InspectionAlpha TestBeta Test10k Hrs Of Gamma Test (Production)
Most Software Faults Are TransientMVS Functional Recovery Routines 5:1Tandem Spooler 100:1Adams >100:1
Terminology:Heisenbug: Works On RetryBohrbug: Faults Again On Retry
Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984Gray: "Why Do Computers Stop", Tandem TR85.7, 1985Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.
27
Heisenbugs:Heisenbugs: A Probabilistic Approach to AvailabilityA Probabilistic Approach to AvailabilityThere is considerable evidence that (1) production systems have about one bug per thousand lines of code (2) these bugs manifest themselves in stochastically: failures are due to confluence of rare events, (3) system mean-time-to-failure has a lower bound of a decade or so. To make highly available systems, architects must tolerate these failures by providing instant repair (un-availability is approximated by repair_time/time_to_fail so cutting the repair time in half makes things twice as good. Ultimately, one builds a set of standby servers which have both design diversity and geographic diversity. This minimizes common-mode failures.
28
Process Pair Repair StrategyIf software fault (bug) is a Bohrbug, then there is no repair
“wait for the next release” or “get an emergency bug fix” or“get a new vendor”
If software fault is a Heisenbug, then repair is
reboot and retry orswitch to backup process (instant restart)
PROCESS PAIRS Tolerate Hardware Faults
HeisenbugsRepair time is seconds, could be mili-seconds if time is criticalFlavors Of Process Pair: Lockstep
Server Resets At Takeover But What About Application State?
Database State?
Network State?
Answer: Use Transactions To Reset State!
Abort Transaction If Process Fails.
Keeps Network "Up"
Keeps System "Up"
Reprocesses Some Transactions On Failure
SESSIONPRIMARYPROCESS
BACKUPPROCESS
STATEINFORMATION
LOGICAL PROCESS = PROCESS PAIR
30
PROCESS PAIRS - SUMMARYTransactions Give Reliability
Process Pairs Give Availability
Process Pairs Are Expensive & Hard To Program
Transactions + Persistent Process Pairs
=> Fault Tolerant Sessions &Execution
When Tandem Converted To This Style
Saved 3x Messages
Saved 5x Message Bytes
Made Programming Easier
31
SYSTEM PAIRSFOR HIGH AVAILABILITY
Programs, Data, Processes Replicated at two sites.Pair looks like a single system.System becomes logical conceptLike Process Pairs: System Pairs.Backup receives transaction log (spooled if backup down).If primary fails or operator Switches, backup offers service.
Primary Backup
32
SYSTEM PAIR CONFIGURATION OPTIONS
Mutual Backup:
each has1/2 of Database & Application
Hub:
One site acts as backup for many others
In General can be any directed graph
Stale replicas: Lazy replication
Primary Backup
Primary Backup
Primary
Primary
Primary Backup
Copy
Copy Copy
33
SYSTEM PAIRS FOR: SOFTWARE MAINTENANCE
Similar ideas apply to:Database ReorganizationHardware modification (e.g. add discs, processors,...)Hardware maintenanceEnvironmental changes (rewire, new air conditioning)Move primary or backup to new location.
V2
(Pr imary)
(Backup )
V1 V1
(Pr imary)
(Backup)
V1 V2
St ep 1: Bot h systems are running V1. Step 2: Backup is cold-loaded as V2.
(Backup)
(Pr imary)
V1 V2
(Backup )
(Pr imary)
V2
Step 3: SWITCH to Backup. Step 4: Backup is cold-loaded as V2 D30.
34
SYSTEM PAIR BENEFITSProtects against ENVIRONMENT: weather
utilitiessabotage
Protects against OPERATOR FAILURE: two sites, two sets of operators
Protects against MAINTENANCE OUTAGESwork on backupsoftware/hardware install/upgrade/move...
Protects against HARDWARE FAILURESbackup takes over
Protects against TRANSIENT SOFTWARE ERRORR
Allows design diversity
different sites have different software/hardware)
35
Key Idea Architecture Hardware Faults Software Masks Environmental Faults Distribution Maintenance • Software automates / eliminates operators So, • In the limit there are only software & design faults.
Many are Heisenbugs
Software-fault tolerance is the key to dependability.
INVENT IT!
} { }{
36
ReferencesAdams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of
Research and Development. 28(1): 2-14.0Anderson, T. and B. Randell. (1979). Computing Systems Reliability. Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE
Compcon 90. 573-577.Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on
Reliability in Distributed Software and Database Systems. 3-12.Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE
Transactions on Reliability. 39(4): 409-418.Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann.Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and
Implementation: An Advanced Course. ACM, Springer-Verlag.Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology.
15’th FTCS. 2-11.Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc
10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.Darrell Long, Andrew Muir and Richard Golding, ``A Longitudinal Study of Internet Host
Reliability,'' Proceedings of the Symposium on Reliable Distributed Systems, Bad Neuenahr, Germany: IEEE, September 1995, pp. 2-9