DS - IX - NFT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc
DS - IX - NFT - 1
HUMBOLDT-UNIVERSITÄT ZU BERLININSTITUT FÜR INFORMATIK
DEPENDABLE SYSTEMS
Vorlesung 1
INTRODUCTION
Wintersemester 2000/2001
Leitung: Prof. Dr. Miroslaw Malek
www.informatik.hu-berlin.de/~rok/ftc
DS - IX - NFT - 2
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:
1. Introduction (Unit I)– Motivation– System views– Dependability rings– Dependable design methodology
2. Dependability Concepts, Measures and Models (UNIT DCMM)– Basic definitions– Dependability measures– Dependability models– Examples– Dependability evaluation tools
3. Testing Techniques (UNIT TT)– Testing techniques principles– Processor testing – Memory testing– Network testing
DS - IX - NFT - 3
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:
4. Fault Diagnosis Techniques (UNIT FST)– Fault detection techniques– Fault location (isolation) methods
5. Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level)– Dynamic techniques– Static techniques– Hybrid techniques
6. Fault-tolerant and Fault-secure Memories (UNIT FRTT)– Fault-tolerant techniques in manufacturing– Replication– Coding– Reconfiguration
DS - IX - NFT - 4
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:
7. Network Fault Tolerance (UNIT NFT)– Computer networks
– Basic techniques
– Example – multistage networks
8. Case Studies (UNIT CS)– ESS and 3B20
– FTMP – Fault-tolerant Multiprocessor
– SIFT – Software-implemented Fault Tolerance
– Communication controller
– Fault-tolerant Building Block Architecture
DS - IX - NFT - 5
COURSE ACTIVITIES
• PROJECT• PRESENTATION• INVITED SPEAKERS• CONFERENCES AND WORKSHOPS
• Some Websites: – www.dependability.org– www.paradise.caltech.edu– www.milan.eas.asu.edu– www.crhc.uiuc.edu
DS - IX - NFT - 6
Major References on Fault-tolerant Computing (Books/General) 1
• Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970.
• Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971.
• Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.
• Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981.
• Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982.
• Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995.
• Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985.
• Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986.
DS - IX - NFT - 7
Major References on Fault-tolerant Computing (Books/General) 2
• Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987.
• Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.
• Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.
• Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag Wien New York, 1992.
• Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993.
• Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, System Implementation, Kluwer Academic Publishers, 1994.
• Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994.
DS - IX - NFT - 8
Major References on Fault-tolerant Computing (Books/General) 3
• Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994.
• Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994.• Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems,
Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995.
• Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995.
• Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.
• A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997• W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999• S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser
Muenchen, 1999.
DS - IX - NFT - 9
Major References on Fault-tolerant Computing (Books/Reliability Evaluation)
• Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976.
• Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982.
• Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984.
• Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.
• W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999
DS - IX - NFT - 10
Major References on Fault-tolerant Computing (Books/Coding)
• Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968.
• Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972.
• Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978.
• Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983.
• Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986.
• Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989.
DS - IX - NFT - 11
Major References on Fault-tolerant Computing (Books/Software)
• Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.
• Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982.
• Shooman, M. L., Software Engineering, McGraw-Hill, 1983.
• Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.
• Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.
• Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.
• Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.
• Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995.
DS - IX - NFT - 12
Major References on Fault-tolerant Computing (Journals)
• Special Issue of Proc. Of IEEE, October 1978• Special Issue of Computer, October 1979• Special Issue of Computer, March 1980• Special Issue of Computer, August 1984• Special Issue of IEEE Software, May 1995• IEEE Trans. on Reliability• IEEE Trans. On Software Engineering• Computer• Design and Test• Electronics• Proc. Of IEEE• Computer Design• Journal of Electronic Testing: Theory and Applications• Journal of Parallel and Distributed Computing• IEEE Trans. on Parallel and Distributed Computing• Real-Time Systems Journal
DS - IX - NFT - 13
Major References on Fault-tolerant Computing (Conference Proceedings)
• Fault-Tolerant Computing Symposium
• Reliability and Maintainability Symposium
• Reliability in Distributed Software and Database Systems Symposium
• Test Conference
• Distributed Computing Systems Conference
• Parallel Processing Conference
• Real-Time Systems Symposium
• Computer Architecture Symposium
DS - IX - NFT - 14
INTRODUCTION
• OBJECTIVES:– MOTIVATION FOR FAULT-TOLERANT SYSTEMS
– TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY
– TO PRESENT BASIC CONCEPTS AND APPROACHES
– TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY
• CONTENTS: – MOTIVATION
– SYSTEM VIEWS
– SYSTEM DEPENDABILITY CONCEPTS
– APPROACHES TO DEPENDABLE DESIGN
– DEPENDABILITY RINGS
– DEPENDABLE DESIGN METHODOLOGY
DS - IX - NFT - 15
TYPES OF SYSTEMS
• Dependable (Reliable) System– A system which delivers a required service during its lifetime
• Fault-Tolerant Computer Systems– A system that has the capability to continue the correct execution of
its programs and input/output functions in the presence of faults
• Real-Time-Computer Systems– are the ones that deliver service to a user within a specified
deadline (physical time, duration, etc.)
• Responsive Computer System– are Fault-Tolerant Real-Time Systems that deliver satisfactory
service in a timely manner
DS - IX - NFT - 16
MOTIVATION FOR RELIABLE AND FAULT-TOLERANT COMPUTING
• ECONOMIC NECESSITY
• LIFE SAVING
• NOVICE USERS
• HARSH ENVIRONMENTS
• MORE COMPLEX SYSTEMS
DS - IX - NFT - 17
DEVICE RELIABILITY AND SYSTEM RELIABILITY
106
105
104
103
102
10
1
1950 1960 1970 1980 1990
Equivalent –
Device Reliability
Mean Time between Failures
(MTBF) in Years Minimum Acceptable
Reliability
System Reliability
Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI
DS - IX - NFT - 18
DEPENDABILITY – PERFORMANCE TRADE-OFF
1 10 100 1000 10000 100000
0.99999
0.9999
0.999
0.99
0.9
Massively Parallel/
Distributed Systems
CommercialFault-Tolerant
Systems
Ultra Reliable Systems
Ava
ilabi
lity
Throughput (MIPS)
DS - IX - NFT - 19
EXAMPLES
• DEFENSE SYSTEMS• FLIGHT SYSTEMS• AIR TRAFFIC CONTROL• COMMUNICATION SYSTEMS• BANKING SYSTEMS• AIRLINE SEAT RESERVATIONS• TELEPHONE SYSTEMS• HOUSEHOLD APPLIANCES• VIDEO GAMES
DS - IX - NFT - 20
VIEW 1: SYSTEM LIFE CYCLE
SYSTEM CONSTRAINTS
OBSOLESCENCE NEEDSNEW
TECHNOLOGY
CONCEPT FORMULATION
SYSTEM SPECIFICATION
DESIGN
PROTOTYPE
PRODUCTION
INSTALLATION
OPERATIONAL LIFE
MODIFICATION AND RETIREMENT
• Notice that testing, verification or validation should occur after every phase of life cycle
• Very few tools exist, and for some steps of the cycle only
DS - IX - NFT - 21
VIEW 2: PACKAGING LEVELS OF INTEGRATION
• APPLICATIONS• APPLICATIONS MODULES• SPECIAL-PURPOSE LANGUAGES• STANDARD LANGUAGES• OPERATING SYSTEMS• CABINETS/FRAMES• BOXES/CAGES• PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs• INTEGRATED CIRCUITS (CHIPS)
• Dependability must be considered at every level• System decomposition (partitioning) may have a significant
impact on dependability
DS - IX - NFT - 22
VIEW 3: WORKLOAD VIEW
PREPARATION USEFUL
WORK
SEMI USEFUL WORK
FAULT
SERVICING
IDLING
LIVEWARE
HARDWARE/ SOFTWARE
• ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY
DS - IX - NFT - 23
VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS
Disks, Tapes Quantum & El-ectromagnetic
Transistors
Resistors, Capacitors, Inductors, Power Sources, Diodes
Circuit
Data Paths, Registers, Data Operators, Control (Hardwired), Microprogramming (Microstore)
Register Trans- fer Level (RTL)
Logic
Software, Memory State, Processor State, Effective Address Calculation, Instruction Decode, Instruction Execution
HLL, ISP (Inst- raction Set
Processor
Program
Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os
PMS
COMPONENTSSUBLEVELLEVEL
• DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL
DS - IX - NFT - 24
VIEW 5: COMPUTER SYSTEM SOFTWAREPACKAGES
ASSEMBLERS
COMPILERS
OPERATING SYSTEMS
UTILITY PROGRAMS
DEBUGGING PROGRAMS
FILE PROCESSING PROGRAMSFIRMWARE
MICROPROGRAM & MICROPRO-
GRAMMING SYSTEMSHARDWARE
CPUs
I/O DEVICES
MEMORIES
INTERCONNECTION NETWORKS
LIVEWARE
MAINTENANCE PERSONNEL
OPERATORS
SYSTEM DESIGNERS
SYSTEM ANALYSTS
PROGRAMMERS
USERS
FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS)
DS - IX - NFT - 25
(WARNING!!!)
VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY
YOU MAY END UP WITH THE FOLLOWING:
SIX PHASES OF A PROJECT
1. ENTHUSIASM2. DISILLUSIONMENT3. PANIC AND HYSTERIA4. SEARCH FOR THE GUILTY5. PUNISHMENT OF THE INNOCENT6. PRAISE AND AWARDS FOR THE NON-PARTICIPANTS
(Author unknown – found in one of the computer companies)
DS - IX - NFT - 26
SYSTEM DEPENDABILITY CONCEPTS
• RELIABILITY– Is a conditional probability that the system will perform its intended function
without failure at time t provided it was fully operational at time t = 0
• AVAILABILITY– Instantaneous availability is the probability that a system is performing
correctly at time t and is equal to reliability of non-repairable systems A (t) = R (t)
– Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime
As (t) =
• SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset
LIFETIME
UPTIME
DS - IX - NFT - 27
APPROACHES
• FAULT INTOLERANCE
• FAULT TOLERANCE
• MAINTAINABILITY
• HARDWARE/SOFTWARE TRADE-OFFS
DS - IX - NFT - 28
HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION
HARDWARE
INSTRUCTIONS
INTEGER ARITHMETIC ADD/SUB
MPY/DIV
FLOATING-POINT ARITHMETIC
VECTOR PROCESSING
MULTIPROCESSING (e.g., submachine set-up)
SOFTWARE
EXAMPLES
M6800
MC68000
VAX-11/780 IBM-30XX
CRAY-XMP C-205
SYSTOLIC ARRAYS, RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS
VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa.
Vertical Migration improves performance and dependability, and reduces cost.
DS - IX - NFT - 29
DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE
Logic Level
Acceptance Test
Register-Transfer Level
Acceptance Test
System Hardware
Acceptance Test
Operating System, Languages and Application
Acceptance TestDependability
Rings
Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery)
DS - IX - NFT - 30
A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM
Diagnostic and
Maintenance Processor (s)
(Hardcore)
Processor
Memories
Network
Test Rings
DS - IX - NFT - 31
DEPENDABLE DESIGN METHODOLOGY
• Identify fault classes, fault latency and fault impact• Determine qualitative and quantitative specs for fault tolerance
and evaluate your design in specific environment • Identify “weak spots” and assess potential damage• Decompose the system• Develop fault and error detection techniques and algorithms• Develop fault isolation techniques and algorithms• Develop recovery/reintegration/restart• Evaluate degree of fault tolerance• Refine, iterate for improvement; try to eliminate “weak spots”
and minimize potential damage
DS - IX - NFT - 32
REAL-TIME SYSTEMS DESIGN
• Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment.
• Characterize timing of a system (hardware and software).• Map timing specification onto a system timing (find the best
resource allocation and scheduling methods), and incorporate concurrent monitoring.
• Verify and validate the design for quantitative and qualitative specifications.
• Refine, iterate and fine-tune the design.
DS - IX - NFT - 33
RESPONSIVE SYSTEM DESIGN
• Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements.
• Determine system timing (hardware and software) assess damage, availability and responsiveness.
• Develop and time fault and error detection techniques and algorithms.• Develop and time fault isolation techniques and algorithms.• Develop time recovery/reintegration/restart.• Map timing specification onto system timing under appropriate
assumptions and incorporate concurrent monitoring.• Evaluate responsiveness.• Refine and iterate for improvement.
RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME
DS - IX - NFT - 34
REFERENCES(TEXTBOOK)
• C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978.
• G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987.
• M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988.
• Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994
• Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.