SENG 521 Software Reliability & Software Qualitypeople.ucalgary.ca/~far/Lectures/SENG521/PDF/SENG521-08.pdf · m – out of – n System System hasSystem has n components. At least

SENG 521SENG 521Software Reliability & Software Reliability & Software Reliability & Software Reliability & Software QualitySoftware Qualityyy

Chapter Chapter 8: System Reliability8: System Reliability

Department of Electrical & Computer Engineering, University of CalgaryDepartment of Electrical & Computer Engineering, University of Calgary

B.H. Far （[email protected]）http://www.enel.ucalgary.ca/People/far/Lectures/SENG521

[email protected] 1

ContentsContents

System reliability Reliability Block Diagram (RBD)Reliability Block Diagram (RBD) Serial and parallel configuration Active redundancy Hazard analysis: FMEA FTA Hazard analysis: FMEA, FTA

[email protected] 2

Dependability ModelsDependability ModelsDependability models capture conditions that make Dependability models capture conditions that make a system fail in terms of the structural relationships between the system componentsy p

Dependability modelsDependability models

Reliability Graph

Fault Tree model

y p

Reliability Block Diagram

[email protected] 3

System Reliability /1System Reliability /1

A system usually consists of components. Each component consists of sub-components.Each component consists of sub components. Components may have

Different reliability Different dependencies among each other

System reliability is a function of the reliabilities of the (sub ) components and ofreliabilities of the (sub-) components and of the relationships between the components.

[email protected] 4

Reliability Block Diagram (RBD)Reliability Block Diagram (RBD)Reliability Block Diagram (RBD) is a graphical Reliability Block Diagram (RBD) is a graphical representation of how the components of a system are connected from reliability point of view. y p

Reliability of the system is derived in terms of reliabilities of its individual components

The most common configurations of an RBD are the series and parallel configurations. A i ll d f bi i f A system is usually composed of combinations of serial and parallel configurations.

RBD analysis is essential for determining reliability RBD analysis is essential for determining reliability, availability and down time of the system.

[email protected] 5

Reliability Block Diagram (RBD)Reliability Block Diagram (RBD)

i l fi i h l In a serial system configuration, the elements must all work for the system to work and the system fails if f h f il Th llif one of the components fails. The overall reliability of a serial system is lower than the

li bili f i i di id lreliability of its individual components. In parallel configuration, the components are

considered to be redundant and the system will still cease to work if all the parallel components fail. The overall reliability of a parallel system is higher than the reliability of its individual components.

[email protected] 6

RBD Example: Storage /1RBD Example: Storage /1

host bus adapter adapte(HBA)

[email protected] 7

COPYRIGHT: DELL

RBD Example: Storage /2RBD Example: Storage /2host bus adapter (HBA)

[email protected] 8

COPYRIGHT: DELL

RBD: Process StepsRBD: Process Steps

1. Define boundary of the system for analysis2. Break system into functional components2. Break system into functional components3. Determine serial-parallel combinations 4. Represent each components as a separate

block in the diagramg5. Draw lines connecting the blocks in a

logical order for mission successlogical order for mission success

[email protected] 9

Serial System Reliability /1Serial System Reliability /1N d d No redundancy

ALL component of the psystem are needed to make the system function yproperly

If any one of thePower Unit

diag

ram

diag

ram

If any one of the components fails, the system fails

CPU

Hard Drivem b

lock

dm

blo

ck d

system fails ExampleExample

Hard Drive

Syst

emSy

stem

GPU

[email protected] 10

GPU

Serial System Reliability /2Serial System Reliability /2R li bilit Bl k Di Reliability Block Diagram

System is composed of n independent serially t d tconnected components.

Failure of any component has a cross system effect, i lt i f il f th h l ti.e., results in failure of the whole system.

R / R / ...

ExampleExample

R1/1 R2 /2

Power UnitR1 /1

CPUR2 /2

Hard DriveR3 /3

GPUR4 /4


/

Combining Combining ReliabilitiesReliabilitiesSerial system reliability can be calculated from Serial system reliability can be calculated from component reliabilities, if the components fail independently of each other. p y

For serial systems:p pQ Q Qp number of components

11

p pQ Q

k kkk

R R and

Qp number of components

Rk component reliability

Components reliabilities (Rk) must be expressed with respect to a common interval.

A serial system has always smaller reliability than its components (because Rk 1).


Example: Serial SystemExample: Serial System

The system is composed of 3 independent serially connected componentsy p R1 = 0.95 R2 = 0 87

Note: all Rs must be given for a common duration R2 = 0.87

R3 = 0.82for a common duration, e.g., 10 hours of operation.

Rsystem = 0.95 0.87 0.82 = 0.6777 Serial system reliability is smaller than any Serial system reliability is smaller than any

individual reliability of the components


Parallel System Reliability /1Parallel System Reliability /1S i h d d System with Redundancy

Only ONE out of N (identical) components is needed to make the system function properly

If ALL of the System block diagramSystem block diagram

components fail, the system fails

Server 1 Server 2y

ExampleExample


Parallel System Reliability /2Parallel System Reliability /2

System is composed of n independent components connected in parallel.p p

Failure of all components results in the failure of the whole system (principle offailure of the whole system (principle of active redundancy).

R /1R1 /1

R /1

pQ

kk

unreliability F F

R2 /2

...

11 1

1 11 1 1p p pQ Q Q

k kkk k k

R F R and


Parallel System Example Parallel System Example Th t i d f 2 id ti l The system is composed of 2 identical servers connected in parallel

S 1 R1 = 0.6777

0 6777

Server 1Rs1 /s1

R2 = 0.6777

R 1 ((1 0 6777)Server 2Rs /sRsystem = 1 – ((1 – 0.6777)

(1 – 0.6777)) = 0.8961Rs2 /s2

Parallel system reliability is greater than any individual reliability of the components


Sequential System Reliability Sequential System Reliability h f il h i i d When one component fails, the next one is assigned

to fulfill the job (e.g., telephone switching circuits) This is done in sequence until no components left Reliability for such system:y y

C1

iC2

C

tλ1n

0i

ii

sysie

i!tλ

(t)R

Ci

C

0i i!


Cn

Mixed System Reliability /1Mixed System Reliability /1S i h d d System with Redundancy

Only ONE out of N set of components is needed to make the system function System block diagramSystem block diagram

properly If ALL sets of Power Unit Power Unit

components fail, the system fails Processor Processory

ExampleExampleGPU GPU


Mixed System Example Mixed System Example

Reliability Block DiagramServer 1 Rs1 /s1Server 1 Rs1 /s1

Server 2 Rs2 /s2System levelSystem level /System levelSystem levelComponent levelComponent level

Power UnitR1 /1

ProcessorR2 /2

GPUR3 /3

Power UnitR /

ProcessorR /

GPUR /


R1 /1 R2 /2 R3 /3

ParallelParallel--Series SystemSeries System

pathR11 R12 R1j R1n

R R R RR21 R22 R2j R2n

Ri1 Ri2 Rij Rini1 i2 ij in

Rm1 Rm2 Rmj Rmn

n

iji RRy reliabilit Path1j

)R-(1)R-(1Ry reliabilit Systemn

1j

ij

m

1i

m

1i

i 11


1j1i1i

Various Configurations /3Various Configurations /3

Series-Parallel configuration

Power Unit 1 Processor 1 GPU 1

Bus 1

Bus 2

Power Unit 2 Processor 2 GPU 2


Various Configurations /4Various Configurations /4

Reliability Block Diagram for Series-Parallel configurationg

Power Unit Processor GPUPower UnitR1 /1

ProcessorR2 /2

GPUR3 /3

Power UnitR1 /1

ProcessorR2 /2

GPUR3 /3R1 /1 R2 /2 R3 /3


SeriesSeries--Parallel SystemParallel SystemR1j

R

R1n

R

R12

R

R11

R R2j

Rij

R2n

Rin

R22

Ri2

R21

Ri1

subsystemRmj RmnRm2Rm1

subsystem

)R-(11Ry reliabilit Subsystemm

ijj

n m

ij

n

j )R-(11RRy reliabilit System

1i


1j 1i1j

j

Other ConstructsOther Constructs

One-way bridge

Two-way bridge


Active RedundancyActive RedundancyE l ll l t Employs parallel systems.

All components are active at the same time. p Each component is able to meet the

functional requirements of the system.functional requirements of the system. Only one component is required to meet the

functional requirements of the systemfunctional requirements of the system. Each component satisfies the minimum

li bili di i f hreliability condition for the system. System only fails if all components fail.


y y p

m m –– out of out of –– n Systemn SystemSystem has n components System has n components.

At least m components need to work correctly for the system

R1work correctly for the system to function properly (m n). m=n: serial system

R2m/n

m=1: parallel system e.g.: airplane with 4 engines

fl ith l 2 i

Ri

R

m/n

can fly with only 2 engines. Rn

ii1m n

ini

0isys R(t))(1R(t)

in

1(t)R

n!n

Assumption: All components

have the same reliability.


i)!(n i!i

Example: BicycleExample: BicycleC fi d Can you find any redundancy? Handle? Saddle? Frame? Brake? Gear? Wheel? Wheel spokes?

SENG521 (Winter 2008) [email protected] 27

RBD: Example /1RBD: Example /1

Possible single point of failure

(SPF)


RBD: Example /2RBD: Example /2

Possible over redundancy


SPF: How to Avoid?SPF: How to Avoid?

Ad d d di i il h d Adopt redundancy Use dissimilar methods –consider common-cause vulnerability

Adopt a fundamental design change Use equipment which is extremely reliable/robustq p y Perform frequent Preventive Maintenance/

Replacement repair before failure happensReplacement repair before failure happens Reduce or eliminate service and/or environmental

stresses extreme stress leads to failurestresses extreme stress leads to failure


Example 1 Example 1 In reliability prediction for an assembled system we usually In reliability prediction for an assembled system we usually

use a “bottoms-up” approach by estimating the failure rate for each subsystem and then combining the failure rates for the entire assembly The following figure illustrates a systemthe entire assembly. The following figure illustrates a system in which the subsystems A, B, C, and D are in a serial configuration. Each subsystem is composed of several parts

hi h t d i l ll l hwhich are connected as serial or parallel as shown.


Example 1 (cont’d)Example 1 (cont’d)

The failure rates of serial parts 1, 2, and 3 are 0.1, 0.3, and 0.5 per hour, respectively. p p yDetermine the failure rate and MTTF for subsystem A.subsystem A.

1 2 3A 0.1 0.3 0.5 0.9 failures per hour

1 1A

1 1 1.11 hours0.9AMTTF


Example 1 (cont’d)Example 1 (cont’d)The failure rates of parallel parts 4 5 and 6 in The failure rates of parallel parts 4, 5, and 6 in subsystem B are 0.2, 0.4, and 0.25 per hour, respectively. Determine the failure rate and MTTF p yfor subsystem B.

1 (1 )(1 )(1 )BtR e R R R4 5 6

0.20 0.40 0.25

1 (1 )(1 )(1 )

1 (1 )(1 )(1 )

B

B

Bt

B

R e R R R

R e e e e

B

0.9868= 0.0133 failures per hour

BtBR e

B p1 75.1879 hoursBMTTF


B

Example 1 (cont’d)Example 1 (cont’d)The failure rate of parallel parts 7 is constant and is The failure rate of parallel parts 7 is constant and is 0.2 per hour. Determine the failure rate and MTTF for subsystem C in which at least 2 out of 3 items ymust be working.

1

0.27 7 71 1 0.8187

mn ii

C

nR R R R e

7 7 70

0 3 1 2 3 27 7 7 7 7 7 7

1 1 0.8187

3 31 (1 ) (1 ) 1 (1 ) 3 (1 )

0 1

Ci

C

R R R R ei

R R R R R R R R

7 7 7 7 7 7 7

3 2

( ) ( ) ( ) ( )0 1

1 (1 0.8187) 3 0.8187(1 0.8187) =0.9133

C

CR

0.9133 tCR e C =1.004 failures per hour

1 0.9960 hoursCMTTF


C

Example 1 (cont’d)Example 1 (cont’d)Th f il t f ll l t 8 d 9 i The failure rates of parallel parts 8 and 9 in subsystem D are 0.25 and 0.2 per hour,

i l D i h f il drespectively. Determine the failure rate and MTTF for subsystem D.

8 9

0.25 0.20

1 (1 )(1 )

1 (1 )(1 )

D

D

tD

t

R e R R

R e e e

1 (1 )(1 )

0.95990 0409 f il h

D

Dt

D

R e e e

R e

D = 0.0409 failures per hour1 24 4498 hoursMTTF


24.4498 hoursDD

MTTF


Determine the overall failure rate and MTTF for the whole system.y

0.9 0.0133 1.004 0.0409system A B C D

1.9582 failures per hour

1system

1 0.5107 hourssystemsystem

MTTF


Example 2Example 2h f i f h i lThe fastening of two mechanical

parts in an automobile brake h ld b li bl I i E1E3system should be reliable. It is

done by means of two flanges hi h d h i h 4

E1

E2

E3

E4which are pressed together with 4 bolt and nut pairs E1 to E4 placed 90 d h h90 degrees to each other.

Experience shows that the fastening h ld h l b l dholds when at least one bolt and nut pair located opposite to each otherwork i e (E1 and E2) or (E3 and E4)


work, i.e., (E1 and E2) or (E3 and E4)

Example 2 (cont’d)Example 2 (cont’d)) h li bili bl k di ( ) fa) Draw the reliability block diagram (RBD) for this fixation of the system.



b) Compute reliability of the fixation described in (a) if all bolt and nut pairs have the ( ) preliability of R=0.90 for 50K miles of operation.operation.

R1 = R2 = 0.9 × 0.9 = 0.81Rsystem = 1 – (1 – R1) (1 – R2) = 1 – (1 – 0.81) (1 – 0.81) = 0.9639( ) ( )



c) Name the redundancy type if any pair of bolts and nuts were sufficient for the required brake function.

T t f f d dTwo-out-of-four redundancy.



d) Compute reliability for case (c) if all bolt and nut pairs have the reliability of R=0.90 p yfor 50K miles of operation.

0 4 1 34 41 0.9 1 0.9 0.9 1 0.9

0 1R

4 31 0.1 3.6 0.1 0.9963

e) What can be concluded?


RBD: When and HowRBD: When and HowWh t RBD? When to use RBD? When reliability of a complex system must be calculated

and the reliability wise weaknesses of the system must beand the reliability-wise weaknesses of the system must be identified.

How to use RBD?How to use RBD? Draw the RDB diagram. sometimes not that simple! Calculate the system reliability using the RBD diagram.y y g g Perform calculation such as availability and downtime.

There are a number of automated tools, integrated , gwith the other methods, such as Fault Tree Analysis (FTA), to generate the diagram and to analyze it.


RBD: Benefits & LimitationsRBD: Benefits & LimitationsRBD i th i l t f i li i th li bilit f th RBD is the simplest way of visualizing the reliability of the complex systems.

The benefits of the RBD are: The benefits of the RBD are: Establishes reliability goals; Evaluates component failure impact on

overall system safety; Provides a basis for “what if” analysis; Allocates component reliability by calculating system MTBF;Allocates component reliability by calculating system MTBF; Provides cost savings in large system trouble-shooting; Estimates system reliability; Analyzes various system configurations in trade-off studies; Identifies potential design problems; Determines systemoff studies; Identifies potential design problems; Determines system sensitivity to component failures

Disadvantages are that some complex constructs, such as t db b hi d l d h i t t b l lstandby, branching and load sharing, etc., cannot be clearly

represented using the traditional RBD constructs.


RBD: ConclusionRBD: Conclusion

Restricting assumptions: p Statistical

independenceindependence Failures

independenceindependence Repairs

i d dindependence


Hazard AnalysisHazard Analysis“A common mistake in engineeringA common mistake in engineering, is to put too much confidence in software. There seems to be a feeling gamong non-software professionals that software will not or cannot fail, which leads to complacency and overwhich leads to complacency and over reliance on computer functions.”

Nancy G. Leveson


Leveson, N.G., Safeware – System Safety and Computers, Addison-Wesley, 1995.

Hazard AnalysisHazard AnalysisGG GoalGoal Identify events that may eventually lead to accidents Determine impact on system

TechniquesTechniques FMEA: Failure Modes and Effects Analysis FMECA: Failure Modes, Effects and Criticality Analysisy y ETA: Event Tree Analysis FTA: Fault Tree Analysisy HAZOP: HAZard and OPerability studies


FMEAFMEA FMEA is a technique to identify and prioritize how FMEA is a technique to identify and prioritize how

systems fail, and identify the effects of failure. FMEA is used when FMEA is used when

Designing products or processes, to identify and avoid failure-prone designs.p g

Investigating why existing systems have failed and to identify possible causes.

Investigating possible solutions, to help select one with an acceptable risk.

Planning actions in order to identify risks in the plan and Planning actions, in order to identify risks in the plan and hence identify countermeasures.


FMEA QuestionsFMEA Questions

Identification:Identification: How ( i.e., in what ways) can this element fail (failure modes)?( )

Ramification:Ramification: What will happen to the system and its environment if this elementsystem and its environment if this element does fail in each of the ways available to it (f il ff )?(failure effects)?

Prevention:Prevention: What needs to be done to prevent or mitigate the problem?


FMEA: How to /1FMEA: How to /1I t f t iI t f t i C tC t Interface matrixInterface matrixReflects how

Components Components

components of the system are connected & function together

s s

Example: pone

nts

pone

nts

Example:Range hood Com

pCo

mp


FMEA: How to /2FMEA: How to /2


FMEA TemplateFMEA Template Sample FMEA template Sample FMEA template

N Part NameP t N

Function FailureM d

Mechanisms & C f

Effect(s)f F il

CurrentC t l

P.R.A. RecommendedC ti

ActionT ko

.Part No. Mode & Causes of

Failureof Failure Control Corrective

ActionTakenP S D R

1 PositionController

Receive a demand

•Loose cable connection

•Wear and tear

•Operator

•Motor fails to move•Position

24

44

13

848

•Replace faulty wire•QC checkeddemand

positionconnection•Incorrect demand signal

•Operator error

•Position controller breakdown in a long-run

•QC checked•Intensive training for operators.

P = Probabilities (chance) of occurrencesS = Seriousness (impact) of failure D = Likelihood that the defect will reach the customer R = Risk priority measure (P x S x D)R Risk priority measure (P x S x D)

1 = very low or none 2 = low or minor 3 = moderate or significant4 = high


4 high5 = very high or catastrophic

FMEA: When?FMEA: When?

FMEA cannot be done until design has proceeded to the point that system elements p p yhave been selected at the level the analysis is to explore in software system afterto explore in software system after software architecture is finalized (requirement volatility kills FMEA!)(requirement volatility kills FMEA!)

FMEA b d ti i th t FMEA can be done anytime in the system lifetime, from initial design onward


FMEA: UsageFMEA: Usage

Industries that frequently use FMEA: Consumer products: toys/home appliances/home p y pp

electronics Automotive industry Automotive industry Aero/space: Boeing, NASA

D f i d t D D Defence industry: DoD Process industries, e.g., oil and gas, chemical

plants industrial control systems Software industry? not popular yety p p y


FMEA: SummaryFMEA: SummaryM th d f k di d f il d f Method: from known or predicted failure modes of components, determine possible effects on systemGood for hazard identification early in development by Good for hazard identification early in development, by considering possible failures of system functions: loss of function (omission failure) loss of function (omission failure) function performed incorrectly function performed when not required

(commission failure)

No good for concurrent multiple failuresd h i h No good when requirements change


Fault Tree Analysis (FTA)Fault Tree Analysis (FTA) Fault tree analysis is a graphical representation of the major Fault tree analysis is a graphical representation of the major

faults or critical failures associated with a product, the causes for the faults, and potential countermeasures. FTA helps identify areas of concern for new system design or foridentify areas of concern for new system design or for improvement of existing systems. It also helps identify corrective actions to correct or mitigate problems.

FTA can also be defined as a graphic “model” of the pathways within a system that can lead to a foreseeable, undesirable event. The pathways interconnect contributory p y yevents and conditions, using standard logic symbols.

Fault tree analysis is useful both in designing new systems, products or services or in dealing with identified problems inproducts or services or in dealing with identified problems in existing products/services. As part of process improvement, it can be used to help identify root causes of trouble and to design remedies and countermeasures


design remedies and countermeasures.

How to Use FTA?How to Use FTA?1 Select a component for analysis Draw a box at the top of the diagram1. Select a component for analysis. Draw a box at the top of the diagram

and list the component inside.2. Identify critical failures or “faults” related to the component. Using

Failure Mode and Effect Analysis is a good way to identify faults duringFailure Mode and Effect Analysis is a good way to identify faults during quality planning. For quality improvement, faults may be identified through Brainstorming or as the output of Cause and Effect Analysis.

3. Identify causes for each fault. List all applicable causes for faults in y ppovals below the fault. Connect the ovals to the appropriate fault box.

4. Work toward a root cause. Continue identifying causes for each fault until you reach a root or controllable cause.

5. Identify countermeasures for each root cause. Use Brainstorming or a modified version of Force Field Analysis to develop actions to counteract the root cause of each critical failure. Create boxes for each countermeasure draw the boxes below the appropriate root cause andcountermeasure, draw the boxes below the appropriate root cause, and link the counter measure and cause.


Steps in FTASteps in FTAFMEA (F il M d d Eff A l i ) d t i th f il FMEA (Failure Modes and Effects Analysis) determines the failure modes that are likely to cause failure events. Then it determines what single or multiple point failures could produce those top level events. FMEA asks the question, “What can go wrong?” even if theevents. FMEA asks the question, What can go wrong? even if the product meets specification. In order to perform FTA one first needs to perform FMEA.


FTA: Example 1FTA: Example 1Tank overflowTank overflowTank overflow

AND

Tank overflow

AND

Inlet

Inlet openOR

Inlet openOR

Outletvalveclosed

Inlet valve B

Wrong controlto inlet valve

Wrong controlto inlet valve

Inlet valve

Sensor

OR

AND

OR

AND

fails

Outlet valve AController

X

Y

SensorX

fails

SensorY

fails

SensorX

fails

SensorY

fails

Controllerfails


fails failsfails fails

FTA: FTA: Example 2Example 2


FTA: Benefits & LimitationsFTA: Benefits & LimitationsBenefits: Benefits: Producing meaningful data for evaluation and

improvement of the overall reliability of the system. p y y Evaluating effectiveness of and need for redundancy.

Limitation: Undesired event evaluated must be foreseen and all

significant contributors to the failure must be anticipated. This effort may be very time consuming and expensive. y y g p

Overall success of the process depends on the skill of the analyst involved.


FTA: SummaryFTA: SummaryM th d f l i b k h h d i Method: trace faults stepwise back through system design to possible causes a tree with a top event at the root a tree with a top event at the root logic gates at branches, linking each event with its “immediate”

causes initiating faults at leaves (eventually)

Good for tracing system hazards to component failures, and ll i f iallocating safety requirements

Good for systems with multiple failuresd f h ki l f f i Good for checking completeness of safety requirements

Can be difficult, time-consuming, hard to maintain



SENG 521 Software Reliability & Software Qualitypeople.ucalgary.ca/~far/Lectures/SENG521/PDF/SENG521-08.pdf · m – out of – n System System hasSystem has n components. At least

Documents