ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Motivation and Introduction Lecture Set 1.
Post on 02-Apr-2015
219 Views
Preview:
Transcript
ECE 753: FAULT-TOLERANT COMPUTING
Kewal K.SalujaDepartment of Electrical and Computer Engineering
Motivation and IntroductionLecture Set 1
ECE 753 Fault Tolerant Computing 2
Overview• Motivation
• About the Course and the Instructor
– Conduct, Outline, Coursepack
• Introduction
• Terminology and definitions– Sources, Overview and Comments
– System defined
• Dependability/Security and their attributes
• Threat to dependability and modeling FEF chain
• Means to attain dependability
• Fundamental Principles
ECE 753 Fault Tolerant Computing 3
Motivation
• Informal Definition
• Key Attributes
• Who, What and Why Study
• Examples
ECE 753 Fault Tolerant Computing 4
Motivation
• What is Fault-Tolerance?
A “fault-tolerant system” is one that continues to perform at desired level of service in spite of failures in some components that constitute the system.
ECE 753 Fault Tolerant Computing 5
Motivation (contd.)
• Key attributes
Fault - Error - Failure
Performance - Availability - Reliability More recently concept of “survivability”
Inclusions of these constraints at design stage is likely to be more cost effective.
ECE 753 Fault Tolerant Computing 6
Motivation (contd.)• Who is concerned about fault-tolerance?
– System Users – irrespective of the application but some are a lot more concerned than others
• Who is concerned at design stages?– Universities
• R, d, and a (Research, development, applications)– Industry
• r, D, and A (research, Development, Applications)• Issues
– Design, Analysis/Validation, Implementation, Testing/Validation, Evaluation
ECE 753 Fault Tolerant Computing 7
Motivation (contd.)
Examples
• General Purpose Systems– PCs: RAMs with parity checks and possibly ECC
(consideration of re-execution on failure detection is being investigated)
– Workstations/Servers: error detection (HW), occasional corrective action (SW), Even ECC (HW), keeping log (SW)
ECE 753 Fault Tolerant Computing 8
Motivation (contd.)
Examples
• Reliable Systems– Telephone systems– Banking systems e.g. ATM– Stock market– CAE - exams/projects– Football games - display/ticketing
ECE 753 Fault Tolerant Computing 9
Motivation (contd.)
Examples
• Critical and Life Critical Systems– Manned and unmanned space borne systems– Aircraft control systems– Nuclear reactor control systems– Life support systems
ECE 753 Fault Tolerant Computing 10
Motivation (contd.)
Examples
• Reliable -> Critical Systems– 911 telephone switching system– Traffic light control system– Automotive control systems (ABS, Fuel
injection system)
ECE 753 Fault Tolerant Computing 11
About the Course and the Instructor
• Conduct– homeworks, exam, project, grading
• Outline
• Coursepack– references and reading list
ECE 753 Fault Tolerant Computing 12
Introduction
– Historical perspective and major push
– New initiatives
– Goals of fault-tolerance
– Applications of fault-tolerance
ECE 753 Fault Tolerant Computing 13
Introduction (contd.)• Historical Perspective
– not a new concept
– first use by J. van Neumann 1956• probabilistic logic and synthesis of reliable organism from
unreliable components, Annals of mathematical studies, Princeton University Press
• Major push– Space program
– HW Fault tolerance - then
– SW Fault tolerance later
– Merge the two
ECE 753 Fault Tolerant Computing 14
Introduction (contd.)
• New initiativesDensity of devices more failures likely
Power issue – schedular, on-chip sensorsFailures due to soft-errors, life time degradations
- hardening, re-exection, - on-chip ECC- erconfiguration- microarchitectural solutions- architectural solutions
ECE 753 Fault Tolerant Computing 15
Introduction (contd.)
• New initiatives (contd.)Deep submicron technology and time to market pressure designs not fully verified Implementation of numerous functionalities on
chip/board/system possibility of system hang-up
Speculative execution results may need to be re-checked
Low cost of HW and SW affordable/ecnomical
• Hot issues: Soft errors, Life-time failures, Power and Thermal Management
ECE 753 Fault Tolerant Computing 16
Introduction (contd.)
• Goals - different goals for different applications
The key word is “reliability” – has different meaning for different users and applications
• Intuitive explanations– Dependability– Service– Specification
ECE 753 Fault Tolerant Computing 17
Introduction (contd.)• Intuitive concepts
– Reliability – continues to work– Availability – works when I need it– Safety – does not put me in jeopardy – Performability – Maintainability– Testability– Survivability – will the system survive
catastrophic events?– Security
ECE 753 Fault Tolerant Computing 18
Introduction (contd.)
• Applications– Space borne system
• long life system
– Airplane control system• critical system
– Transaction processing system• high availability system
– Switching system• high availability over certain level of performance
ECE 753 Fault Tolerant Computing 19
Terminology and definitions
• Reliability and concept of probability– R(t): conditional probability that a system provides
continuous proper service in the interval [0,t] given that it provided desired service at time 0.
• Availability
• Performabiltiy – An Example
• Dependability
• Security
Sources, Overview and Comments (1/4)Key reference:
• Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable and Secure Computing, Vol. 1, No. 1, Jan-Mar 2004.
Other references:• Israel Koren and C. Mani Krishna, Fault Tolerant Systems, Elsevier, 2007.
• D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice-Hall, 1996.
• B. W. Johnson, Design and analysis of fault tolerant digital systems, Addison-Wesley, First edition, 1989.
• My course (Fault-Tolerant Computing) URL: http://homepages.cae.wisc.edu/~ece753/INFO.html
ECE 753 Fault Tolerant Computing
Sources, Overview and Comments (2/4)
• What does the paper cover?– Very basic definitions of the terminologies used in
dependable computing– It categorizes definitions in three groups
• System, attributes of dependability, threats to dependability
– Covers very briefly methods to attain dependability
ECE 753 Fault Tolerant Computing
Sources, Overview and Comments (3/4)
• How to read the paper?– It is easy to read – scan it first and then read it– I have organized the material differently – you may
find it helpful
• What is not covered?– One attribute almost missing - survivability– Basic methods of Fault Tolerance and their
characterization
ECE 753 Fault Tolerant Computing
Sources, Overview and Comments (4/4)
• Chronology of Developments– Need for fault-tolerance - inception of the space program
(recall “Voyager” launched in 1977 is still sending signals)
– First standard glossary in 1985
– Integration of performance etc into fault tolerance – and hence the term “Dependability” – book published in 1992
– Recognition of “Security” as a basic attribute of dependability – this paper in 2004
ECE 753 Fault Tolerant Computing
System Defined (1/4)• “. . . an entity that interacts with other entities”
– First entity (system) – limited to be “electronic (mostly digital)” or “computer based”
– Second entity• Hardware, software, human, other systems, .. (can also be called
“environment”)
• Characterization and fundamental properties– Functionality
– Performance
– Dependability and security
– Cost
(usuability, managability, adaptabilty : not directly included in the paper)
ECE 753 Fault Tolerant Computing
System Defined (2/4)
• Function – “ what the system is intended to do” – functional specifications: describe it in terms of functionality
and performance
– behavior – described as a sequence of states to implement the functionality
– Total states – set of states as system evolves • Internal states
• External states – as viewed by the environment and users
• Structure – “What enables system behavior (function)” – Interconnected components – recursively defined to
“atomic” level
ECE 753 Fault Tolerant Computing
System Defined (3/4)
• System Life Cycle– Development phase
– Use phase
• Service – what is delivered by the system to its “environment” (user)– Environment sees only the “external states”
– Development Phase – activities from concept to decision that system is ready for “use phase”
– Use Phase - More meaningful and includes service delivery, service outage, service shutdown, maintenance
ECE 753 Fault Tolerant Computing
System Defined (4/4)• Development phase environment
– Physical world
– Human developers
– Development tools
– Production and test facilities
• User phase environment– Physical world
– Administrators – maintainers
– Users and intruders
– Providers and infrastructure
ECE 753 Fault Tolerant Computing
Dependability/Security Attributes (1/6)
• Original definition: “ability to deliver service that can justifiably be trusted”
• Encompassing the following attributes– Availability
– Reliability
– Safety
– Integrity
– maintainability
ECE 753 Fault Tolerant Computing
Dependability/Security Attributes (2/6)
• New definition: “ability to avoid service failures that are more frequent or more severe than is acceptable” - deliver service that can justifiably be trusted
• Reason for modification– Security related issues
– This recognizes that a system can fail and it usually does fail and it still can be called dependable
– This definition also enables a connection with “development failures”
ECE 753 Fault Tolerant Computing
Dependability/Security Attributes (3/6) Dependability
• availability: readiness for correct service.
• reliability: continuity of correct service.
• safety: absence of catastrophic consequences on the user(s) and the environment.
• integrity: absence of improper system alterations.
• maintainability: ability to undergo modification and repairs
When addressing security, an additional attribute confidentiality: the absence of unauthorized disclosure
ECE 753 Fault Tolerant Computing
Security is concurrent existence of composite of the attributes
1) availability (for authorized actions only),
2) confidentiality, and
3) integrity (with “improper” meaning “unauthorized”)
Dependability/Security Attributes (4/6)
ECE 753 Fault Tolerant Computing
F
Dependability/Security Attributes (5/6)
ECE 753 Fault Tolerant Computing
• Other related concepts – summarized in table (Fig 15) - these are– Dependability– High confidence– Survivability– Trustworthiness
• Example: all these have similar goals such as 1): ability to deliver service, 2): predictable service, 3): fulfill mission, 4): assurance of expected service delivery
Dependability/Security Attributes (6/6)
ECE 753 Fault Tolerant Computing
Threats and modeling threats (1/12)
• Different phases are open to different types of threats – generally termed as “faults”
• Faults lead to “errors” – a total state of the system different from the “true total state”
• Errors can lead to “failure” – the service deviates from the desired service
• This creates a FEF chain – a hierarchical phenomenon (see next and more later)
ECE 753 Fault Tolerant Computing
fault error
failure
Fault activation – Error manifestation – Failure
Threats and modeling threats (2/12)
Fault – active or dormant
Error – masked or latent
Failure – incorrect response
Threats and modeling threats (3/12)
FEF Chain in an hierarchy
Threats and modeling threats (4/12)
Fault classes
• Groups (not exclusive)– Development, Physical – (that affect hardware - I
disagree with this definition), Interaction
• Viewpoints: – phase, system boundary, cause, dimension,
objective, intent, capability, persistence
Threats and modeling threats (5/12)
Fault Taxonomy and Examples
Production defect: physical, hardware, natural
Bug: physical, software, natural
Omission (absence of an action): Humam made, system generated
Melicious (meant to cause harm): Human made, Hardware or software
Notes:
1. Paper has a classification – Fig 4 and 5
2. Examples and definition of many other faults given. Some listed on next slide
Threats and modeling threats (6/12)
Fault Taxonomy (contd.)
Permanent faults
Intermittent faults – repeat at some interval
Transient faults – no specific interval
Malicious logic faults – caused be natural faults
Intrusion attempts – caused by humans
Interaction faults – may be development phase or use phase
Configuration faults – incorrect setting of parameters
Threats and modeling threats (7/12)
Errors classes
• Detected
• Latent
An example– An adder gives incorrect sum for certain operands
– Fault is active when those operands appear, otherwise it is dormant
– Incorrect sum is latent unless used or checked for correctness
Threats and modeling threats (8/12)
Failure classes
• Development failures
• Service failures
• Security failures
Threats and modeling threats (9/12)
• Development failures – introduced during the development phase– Human developers– Tools – Production facility– Budgetary reasons– Scheduling issue (time to market)
(basically the system delivered is a downgraded system)
Threats and modeling threats (10/12)
• Service failures - delivery of incorrect service – Four viewpoints
1. Failure domain– Content failure– Timing failure – early or late delivery of
the service(s)• Special case: silent failure, halt failure, crash
failure
• Erratic failure (like Byzantine failure)
Threats and modeling threats (11/12)
2. Failure detectability– Signal provided by some checking mechanism
• Signaled failure
• Unsignaled failure
• False alarm
3. Consistency – Consistent failure – all services see the same
data– Inconsistent – different services see different
data (like Byzantine failure)
Threats and modeling threats (12/12)
4. Consequence of failure– Need to rate the failure and hence develop
criteria – examples:• Outage of duration (availability related)
• Lives being endangered (safely related)
• Extent of corrupted service (integrity related)
• Amount of information disclosed (confidentiality related)
Means to attain dependability (1/6)
• Fault Prevention or Fault Avoidance• Improvement of development process
• Elimination of causes that can induce faults
• Fault Tolerance• Techniques and implementations
(more later)
Means to attain dependability (2/6)
• Fault Removal • Remove faults during development phase
– extensive simulation and validation
• Testing• Deterministic testing
• Random and statistical testing
• Back to back testing
Test/validation quality: fault injection, design for test/verification
Means to attain dependability (3/6)
• Fault Forecasting – evaluate the system behavior and then use one or more methods previously discussed to improve dependability• Qualitative evaluation• Quantitative evaluation• Use benchmarks• Use of simulators
Examples: 1) Error and failure logs
2) when and where commissioned
Means to attain dependability (4/6)
• Fault Tolerance Techniques• Error detection - need redundancy
• Duplicate execution
• Use of parity
• Checker programs and/or hardware
• More later
Means to attain dependability (5/6)
• Recovery - Key is redundancy
• Error handling• Masking and compensation
• Rollback
• Rollforward
• Fault handling• Diagnosis
• Isolation
• Reconfiguration
• Initialization
Means to attain dependability (6/6)
• Key to fault tolerance• Break FEF chain
• Use “redundancy” to improve “use phase” dependability and security
• See next “fundamental principles”
ECE 753 Fault Tolerant Computing 52
Fundamental Principles
• Hardware redundancy• Low level
• High level
• Software Redundancy• Time Redundancy• Information Redundancy
ECE 753 Fault Tolerant Computing 53
Fundamental Principles (contd.)
• Hardware Redundancy - Low level– logic level
• Example 1 - Self checking circuits
• Example 2 - Arithmetic code A modular adder using the mathematical principle
(A+B) mod k = ((A mod k) + (B mod k)) mod k
• Hardware Redundancy - High level– Triplicate or 5-copies as in space shuttle
ECE 753 Fault Tolerant Computing 54
Fundamental Principles (contd.)
• Software Redundancy – Use two different programs/algorithms
• Time Redundancy– Re-compute or redo the task and compare the results
– May or may not use the same hardware/software
• Information Redundancy– backup information
– Use of ECC
• Question - What kind of FT is achieved?
ECE 753 Fault Tolerant Computing 55
Fault-Error-Failure
• Intuitive definitions• Origins of faults• Methods to break FEF chain• Attribute of faults
ECE 753 Fault Tolerant Computing 56
Fault-Error-Failure concept (contd.)
Intuitive definitions
• Fault -– An anomalous physical condition caused by a
manufacturing problem, fatigue, external disturbance (intentional or un-intentional), desgin flaw, …
– Causes
• Error - Effect of activation of a fault
• Failure - over-all system effect of an error
Fault -> Error -> Failure
ECE 753 Fault Tolerant Computing 57
Fault-Error-Failure concept (contd.)
Origins of faults
• Physical device level (HW)
• Logic level (HW)
• Chip level (HW)
• System level (HW/SW)– interfacing, specifications, …
• Why systems fail
ECE 753 Fault Tolerant Computing 58
Fault-Error-Failure concept (contd.)
Methods to break FEF chain
• Flow FEF
• Barriers– Fault avoidance– Fault masking– Fault removal– Fault forecasting
ECE 753 Fault Tolerant Computing 60
Fault-Error-Failure concept (contd.)
Attribute of faults
• Cause
• Nature
• Duration
• Extent
• Value
top related