Gray & Reuter FT 2: 1 Dependable Computing Dependable Computing Systems Systems Jim Gray Jim Gray Microsoft, Gray @ Microsoft.com Microsoft, Gray @ Microsoft.com Andreas Reuter Andreas Reuter International University, Andreas.Reuter@i- International University, Andreas.Reuter@i- u.de u.de 9:00 11:00 1:30 3:30 7:00 Overview Faults Tolerance T Models Party TP mons Lock Theory Lock Techniq Queues Workflow Log ResMgr CICS & Inet Adv TM Cyberbrick Files &Buffers COM+ Corba Replication Party B-tree Access Path Groupware Benchmark Mon Tue Wed Thur Fri
36
Embed
Gray & Reuter FT 2: 1 Dependable Computing Systems Jim Gray Microsoft, Gray @ Microsoft.com Andreas Reuter International University, [email protected].
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Gray & Reuter FT 2: 1
Dependable Computing SystemsDependable Computing SystemsJim Gray Jim Gray
Hardware ModulesHardware Modules:: 100,000hr100,000hr 10hr10hr (many are transient)
SoftwareSoftware::
1 Bug/1000 Lines Of Code (after vendor-user testing)1 Bug/1000 Lines Of Code (after vendor-user testing)=> Thousands of bugs in System!=> Thousands of bugs in System!
Most software failures are transient: dump & restart system.Most software failures are transient: dump & restart system.
Case Studies - JapanCase Studies - Japan"Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe)."Survey on Computer Security", Japan Info Dev Corp., March 1986. (trans: Eiichi Watanabe).
VendorVendor (hardware and software) (hardware and software) 5 Months 5 MonthsApplication softwareApplication software 9 Months 9 MonthsCommunications linesCommunications lines 1.5 Years1.5 YearsOperationsOperations 2 Years 2 YearsEnvironment Environment 2 Years 2 Years
PROCESS PAIRSPROCESS PAIRS : :Mask Hardware & Software Faults
TRANSACTIONSTRANSACTIONS: give A.C.I.D. (simple fault model): give A.C.I.D. (simple fault model)
Gray & Reuter FT 2: 15
Example: the FT BankExample: the FT Bank
Modularity & Repair are KEY:Modularity & Repair are KEY:
vonNeumann needed 20,000x redundancy in wires and switchesvonNeumann needed 20,000x redundancy in wires and switches
We use 2x redundancy.We use 2x redundancy.
Redundant hardware can support peak loads Redundant hardware can support peak loads (so not redundant)(so not redundant)
Fault Tolerant Computer Backup System
System MTTF >10 YEAR (except for power & terminals)
Gray & Reuter FT 2: 16
Fail-Fast is Good, Repair is NeededFail-Fast is Good, Repair is Needed
Improving either MTTR or MTTF gives benefitImproving either MTTR or MTTF gives benefit
Simple redundancy does not help much.Simple redundancy does not help much.
Fault Detect
Repair
Lifecycle of a moduleLifecycle of a modulefail-fast gives fail-fast gives short fault latencyshort fault latency
High Availability High Availability
is low UN-Availabilityis low UN-Availability
Unavailability Unavailability MTTRMTTR MTTFMTTF
return
Gray & Reuter FT 2: 17
Hardware Reliability/Availability Hardware Reliability/Availability (how to make it fail fast)(how to make it fail fast)
Comparitor Strategies:Comparitor Strategies:Duplex: Duplex: Fail-Fast: fail if either fails (e.g. duplexed cpus)Fail-Fast: fail if either fails (e.g. duplexed cpus)
vs vs Fail-Soft: fail if both fail (e.g. disc, atm,...)Fail-Soft: fail if both fail (e.g. disc, atm,...)Note: in recursive pairs, parent knows which is bad.
Triplex:Triplex: Fail-Fast: fail if 2 fail (triplexed cpus)Fail-Fast: fail if 2 fail (triplexed cpus) Fail-Soft: fail if 3 fail (triplexed FailFast cpus)Fail-Soft: fail if 3 fail (triplexed FailFast cpus)
Basic FailFast DesignsPair Triplex
Recursive Designs
Recursive Availability Designs
Pair & Spare + + Triple Modular Redundancy
Gray & Reuter FT 2: 18
Redundant Designs Have Worse MTTF!Redundant Designs Have Worse MTTF!
THIS IS NOT GOOD: Variance is lower but MTTF is worseTHIS IS NOT GOOD: Variance is lower but MTTF is worseSimple redundancy does not improve MTTF (sometimes hurts).Simple redundancy does not improve MTTF (sometimes hurts). This is just an example of the airplane rule.This is just an example of the airplane rule.
mttf/12
work1
work0
work
mttf/21.5*mttf
Duplex: fail soft
mttf/13
work2
work1
work0
work
mttf/3 mttf/211/6*mttf
TMR: fail soft
mttf/13
work2
work1
work0
work
0 mttf/23/4*mttf
Pair & Spare: fail fast
4 work
mttf/4
mttf3
work2
work1
work0
work
mttf/2~2.1*mttf
Pair & Spare: fail soft
4 work
mttf/4 mttf/3
2 work
1 work
0 work
mttf/2mttf/2
: Duplex fail fast
mttf/1
3 work
2 work
1 work
0 work
mttf/3 mttf/25/6*mttf
TMR: fail fastmttf/1
Gray & Reuter FT 2: 19
Add Repair: Get 10Add Repair: Get 104 4 ImprovementImprovement
2 work
1 work
0 work
mtbf/2
Duplex: fail fast: mttf/2
mttrmttr
mttf/1
mttf/13
work2
work1
work0
work
mttf/3 mttf/2mttf/1
2 work
1 work
0 work mttrmttrmttrmttrmttr
10 mttf5
TMR: fail softDuplex: fail soft 10 mttf4
3 work
2 work
1 work
0 work
mttf/3
TMR: fail fast
mttr mttr mttr
10 mttf4
mttf/2
mttf/2
mttf/2
Gray & Reuter FT 2: 20
When To Repair?When To Repair?
Chances Of Tolerating A Fault are 1000:1 (class 3)Chances Of Tolerating A Fault are 1000:1 (class 3)A 1995 study: Processor & Disc Rated At ~ 10khr MTTFA 1995 study: Processor & Disc Rated At ~ 10khr MTTF
Computed Single Computed Single Observed Observed FailuresFailures Double Fails Double Fails Ratio Ratio
On-Line Maintenance "Works" 999 Times Out Of 1000.On-Line Maintenance "Works" 999 Times Out Of 1000.The chance a duplexed disc will fail during maintenance ~ 1:1000
Risk Is 30x Higher During MaintenanceRisk Is 30x Higher During Maintenance
=> Do It Off Peak Hour=> Do It Off Peak Hour
Software Maintenance:Software Maintenance:
Repair Only Virulent BugsRepair Only Virulent Bugs
Wait For Next Release To Fix Benign BugsWait For Next Release To Fix Benign Bugs
Gray & Reuter FT 2: 21
OK: So FarOK: So Far
Hardware fail-fast is easyHardware fail-fast is easy
Redundancy plus Repair is great (Class 7 availability) Redundancy plus Repair is great (Class 7 availability)
Hardware redundancy & repair is via modules.Hardware redundancy & repair is via modules.
How can we get instant software repair?How can we get instant software repair?
We Know How To Get Reliable StorageWe Know How To Get Reliable Storage
RAID Or Dumps And Transaction Logs.RAID Or Dumps And Transaction Logs.
We Know How To Get Available StorageWe Know How To Get Available Storage
Defensive Programming:Defensive Programming: Check parameters and data Check parameters and data
Auditors:Auditors: Check data structures in background Check data structures in background
Transactions:Transactions: to clean up state after a failure to clean up state after a failure
Paradox: Need Fail-Fast SoftwareParadox: Need Fail-Fast Software
Gray & Reuter FT 2: 24
Fail-Fast and High-Availability Fail-Fast and High-Availability ExecutionExecution
Software N-Plexing: Design DiversitySoftware N-Plexing: Design DiversityN-Version ProgrammingN-Version ProgrammingWrite the same program N-Times (N > 3)Write the same program N-Times (N > 3)Compare outputs of all programs and take majority vote Compare outputs of all programs and take majority vote
Process Pairs: Instant restart (repair)Process Pairs: Instant restart (repair)Use Defensive programming to make a process fail-fastUse Defensive programming to make a process fail-fastHave restarted process ready in separate environment Have restarted process ready in separate environment Second process “takes over” if primary faultsSecond process “takes over” if primary faultsTransaction mechanism can clean up distributed state Transaction mechanism can clean up distributed state
if takeover in middle of computation.if takeover in middle of computation.
SESSIONPRIMARYPROCESS
BACKUPPROCESS
STATEINFORMATION
LOGICAL PROCESS = PROCESS PAIR
Gray & Reuter FT 2: 25
What Is MTTF of N-Version Program?What Is MTTF of N-Version Program?First fails after MTTF/NFirst fails after MTTF/NSecond fails after MTTF/(N-1),...Second fails after MTTF/(N-1),...
so MTTF(1/N + 1/(N-1) + ... + 1/2)so MTTF(1/N + 1/(N-1) + ... + 1/2)harmonic series goes to infinity, but VERY slowly harmonic series goes to infinity, but VERY slowly
for example 100-version programming gives for example 100-version programming gives ~4 MTTF of 1-version programming~4 MTTF of 1-version programming
Reduces varianceReduces variance
N-Version Programming Needs REPAIRN-Version Programming Needs REPAIRIf a program fails, must reset its state from other If a program fails, must reset its state from other programs.programs.=> programs have common data/state representation.=> programs have common data/state representation.How does this work for How does this work for Database Systems?Database Systems?
Why Process Pairs Mask FaultsWhy Process Pairs Mask FaultsMany Software Faults are SoftMany Software Faults are Soft
After After Design ReviewDesign Review
Code InspectionCode InspectionAlpha TestAlpha TestBeta TestBeta Test10k Hrs Of Gamma Test (Production)10k Hrs Of Gamma Test (Production)
Most Software Faults Are TransientMost Software Faults Are TransientMVS Functional Recovery Routines MVS Functional Recovery Routines 5:15:1Tandem SpoolerTandem Spooler 100:1100:1AdamsAdams >100:1>100:1
Terminology:Terminology:Heisenbug: Works On RetryHeisenbug: Works On RetryBohrbug: Faults Again On RetryBohrbug: Faults Again On Retry
Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984Adams: "Optimizing Preventative Service of Software Products", IBM J R&D,28.1,1984Gray: "Why Do Computers Stop", Tandem TR85.7, 1985Gray: "Why Do Computers Stop", Tandem TR85.7, 1985Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.Mourad: "The Reliability of the IBM/XA Operating System", 15 ISFTCS, 1985.
Gray & Reuter FT 2: 27
Process Pair Repair StrategyProcess Pair Repair StrategyIf software fault (bug) is a Bohrbug, then there is no If software fault (bug) is a Bohrbug, then there is no
repairrepair““wait for the next release” or wait for the next release” or ““get an emergency bug fix” orget an emergency bug fix” or““get a new vendor”get a new vendor”
If software fault is a Heisenbug, then repair is If software fault is a Heisenbug, then repair is reboot and retry orreboot and retry orswitch to backup process (instant restart)switch to backup process (instant restart)
PROCESS PAIRS Tolerate PROCESS PAIRS Tolerate Hardware Faults Hardware Faults HeisenbugsHeisenbugs
Repair time is seconds, could be mili-seconds if time is Repair time is seconds, could be mili-seconds if time is criticalcritical
Flavors Of Process Pair:Flavors Of Process Pair: LockstepLockstepAutomaticAutomaticState CheckpointingState CheckpointingDelta CheckpointingDelta CheckpointingPersistentPersistent
SESSIONPRIMARYPROCESS
BACKUPPROCESS
STATEINFORMATION
LOGICAL PROCESS = PROCESS PAIR
Gray & Reuter FT 2: 28
How Takeover Masks Failures How Takeover Masks Failures
Server Resets At Takeover But What About Server Resets At Takeover But What About Application State?Application State?Database State?Database State?Network State?Network State?
Answer: Answer: Use Transactions To Reset State!Use Transactions To Reset State!Abort Transaction If Process Fails.Abort Transaction If Process Fails.Keeps Network "Up"Keeps Network "Up"Keeps System "Up"Keeps System "Up"Reprocesses Some Transactions On FailureReprocesses Some Transactions On Failure
SESSIONPRIMARYPROCESS
BACKUPPROCESS
STATEINFORMATION
LOGICAL PROCESS = PROCESS PAIR
Gray & Reuter FT 2: 29
PROCESS PAIRS - SUMMARYPROCESS PAIRS - SUMMARY
Transactions Give ReliabilityTransactions Give Reliability
Process Pairs Give AvailabilityProcess Pairs Give Availability
Process Pairs Are Expensive & Hard To ProgramProcess Pairs Are Expensive & Hard To Program
Transactions + Persistent Process Pairs Transactions + Persistent Process Pairs
When Tandem Converted To This StyleWhen Tandem Converted To This Style
Saved 3x MessagesSaved 3x Messages
Saved 5x Message Bytes Saved 5x Message Bytes
Made Programming EasierMade Programming Easier
Gray & Reuter FT 2: 30
SYSTEM PAIRSSYSTEM PAIRSFOR HIGH AVAILABILITYFOR HIGH AVAILABILITY
Programs, Data, Processes Replicated at two sites.Programs, Data, Processes Replicated at two sites.Pair looks like a single system.Pair looks like a single system.System becomes logical conceptSystem becomes logical conceptLike Process Pairs: System Pairs.Like Process Pairs: System Pairs.Backup receives transaction log (spooled if backup down).Backup receives transaction log (spooled if backup down).If primary fails or operator Switches, backup offers service.If primary fails or operator Switches, backup offers service.
Primary Backup
Gray & Reuter FT 2: 31
SYSTEM PAIR SYSTEM PAIR CONFIGURATION OPTIONSCONFIGURATION OPTIONS
Primary BackupMutual Backup: Mutual Backup:
each has 1/2 of Database & Applicationeach has 1/2 of Database & Application
Primary
Primary
Primary
Hub: Hub:
One site acts as backup for many othersOne site acts as backup for many others
In General can be any directed graphIn General can be any directed graph
Primary Backup
Copy
Copy Copy
Stale replicas: Lazy replicationStale replicas: Lazy replication
Backup
Backup
Primary
Primary
Primary
PrimaryPrimary
Copy
CopyCopy
Gray & Reuter FT 2: 32
SYSTEM PAIRS FOR: SYSTEM PAIRS FOR: SOFTWARE MAINTENANCESOFTWARE MAINTENANCE
(Pr imary)
(Backup )
V1 V1
St ep 1: Bot h systems are running V1.
(Pr imary)
(Backup)
V1 V2
Step 2: Backup is cold-loaded as V2.
(Backup)
(Pr imary)
V1 V2
Step 3: SWITCH to Backup.
V2
(Backup )
(Pr imary)
V2
Step 4: Backup is cold-loaded as V2 D30.
Similar ideas apply to:Similar ideas apply to:Database ReorganizationDatabase ReorganizationHardware modification (e.g. add discs, processors,...)Hardware modification (e.g. add discs, processors,...)Hardware maintenanceHardware maintenanceEnvironmental changes (rewire, new air conditioning)Environmental changes (rewire, new air conditioning)Move primary or backup to new location. Move primary or backup to new location.
Gray & Reuter FT 2: 33
SYSTEM PAIR BENEFITSSYSTEM PAIR BENEFITS
Protects against ENVIRONMENT: different sitesProtects against ENVIRONMENT: different sitesweatherweatherutilitiesutilitiessabotagesabotage
Protects against OPERATOR FAILURE: Protects against OPERATOR FAILURE: two sites, two sets of operatorstwo sites, two sets of operators
Protects against MAINTENANCE OUTAGESProtects against MAINTENANCE OUTAGESwork on backupwork on backupsoftware/hardware install/upgrade/move...software/hardware install/upgrade/move...
Protects against HARDWARE FAILURESProtects against HARDWARE FAILURESbackup takes overbackup takes over
Protects against TRANSIENT SOFTWARE ERRORSProtects against TRANSIENT SOFTWARE ERRORSCommercial systems:Commercial systems: Digital's Remote Transaction Router (RTR)Digital's Remote Transaction Router (RTR)
Tandem's Remote Database Facility (RDF)Tandem's Remote Database Facility (RDF)IBM's Cross Recovery XRF( both in same IBM's Cross Recovery XRF( both in same
FT systems fail for the conventional reasonsFT systems fail for the conventional reasonsEnvironmentEnvironment mostlymostlyPeoplePeople sometimessometimesSoftwareSoftware mostlymostlyHardwareHardware RarelyRarely
MTTF of FT SYSTEMS MTTF of FT SYSTEMS ~ 50X conventional ~ 50X conventional
• In the limit there are only software & design faults.In the limit there are only software & design faults.
Software-fault tolerance is the key to dependability.Software-fault tolerance is the key to dependability.
INVENT IT! INVENT IT!
} { }{
Gray & Reuter FT 2: 36
ReferencesReferences
Adams, E. (1984). “Optimizing Preventative Service of Software Products.” IBM Journal of Research and Development. 28(1): 2-14.0
Anderson, T. and B. Randell. (1979). Computing Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990). Issues in Disaster Recovery. 35th IEEE Compcon 90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What Can We Do About It. 5th Symposium on Reliability in Distributed Software and Database Systems. 3-12.
Gray, J. (1990). “A Census of Tandem System Availability between 1985 and 1990.” IEEE Transactions on Reliability. 39(4): 409-418.
Gray, J. N., Reuter, A. (1993). Transaction Processing Concepts and Techniques. San Mateo, Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions. Distributed Systems -- Architecture and Implementation: An Advanced Course. ACM, Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and Fault Tolerance: Concepts and Terminology. 15’th FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991). A study of the reliability of Internet sites. Proc 10’th Symposium on Reliable Distributed Systems, pp. 177-186, Pisa, September 1991.