Dependability: What is it?staff.cs.upt.ro/~vancusa/fsc/c3.pdf · Factors Influencing Software Reliability •A user’s perception of the reliability of a software depends upon two

SOFTWARE

RELIABILITY Part 2

Where are we now?

• Previous course

• System vs software reliability

• Model

• Module vs operation mode

• Software Reliability Prediction

• Metrics

• Software FRACAS

• Musa model

• This course

• Operational profile

• Human reliability

• SRE best practices

Factors Influencing Software Reliability

• A user’s perception of the reliability of a software

depends upon two categories of information.

• The number of faults present in the software.

• The ways users operate the system.

• This is known as the operational profile.

• The fault count in a system is influenced by the following.

• Size and complexity of code

• Characteristics of the development process used

• Education, experience, and training of development personnel

• Operational environment

Human error analysis and reliability

assessment • explore difficulties of use early in design with the aim of

improving design

• hence comparable with other usability and walkthrough techniques

• assessing likelihood of human error of a developed

design as part of an assessment process

• hence comparable with other reliability assessment techniques

• “It must be shown by analysis, substantiated where

necessary by test, that as far as reasonably practicable all

design precautions have been taken to prevent human

errors in production, maintenance and operation causing

hazardous or catastrophic effect”

• Used extensively in the nuclear power industry

Different approaches

Engineering approach

•Quantitative ‘decomposition’

•Human treated as a “component”

•The mechanistic assumption: “The human / mind as a fallible machine”

•The atomistic assumption: Human performance can be adequately described by considering individual elements of the performance. Total performance is an aggregate of the individual performance elements

Cognitive approach

•models and theories of cognitive functions which underlie human behavior

•Cognitive psychology still immature

•Problem: human cognition is not directly observable

Quantification techniques

• HEART: a human performance model-based technique

utilizing some standard probabilities

• A data-based method for assessing and reducing human error to

improve operational performance.

• J.C. Williams (1988) IEEE Fourth Conference on Human Factors and

Power Plants (pp.436-450)

• Based on long-term sizeable human reliability database; weighting

factors based on HF literature.

• Assumes human performance usually deteriorates when Error

Producing Conditions (EPCs) interact

• SLIM: a utility-based technique using team based

judgments

• THERP: earliest method

HEART generic categories

Generic Task Nominal human

unreliability

*5th-95th

percentile

bounds

(A) Totally unfamiliar, performed at speed with no real idea of likely

consequences 0.55 (0.35-0.97)

(B) Shift or restore system to a new or original state on a single attempt without

supervision or procedures 0.26 (0.14-0.42)

(C) Complex task requiring high level of comprehension and skill 0.16 (0.12 - 0.28)

(D) Fairly simple task performed rapidly or given scant attention 0.09 (0.06 - 0.13)

(E) Routine, highly practiced, rapid task involving relatively low level of skill 0.02 (0.007 -

0.045)

(F) Restore or shift a system to original or new state following procedures, with

some checking 0.003 (0.0008 -

0.007)

(G) Completely familiar, well-designed, highly practiced, routine task occurring

several times per hour, performed to highest possible standards by highly

motivated, highly-trained

and experienced personnel, with time to correct potential error, but without the

benefit of significant job aids

0.0004 (0.00008 -

0.009)

(H) Respond correctly to system command even when there is an augmented or

automated supervisory system providing accurate interpretation of system state 0.00002 (0.000006 -

0.0009)

Error producing Conditions (EPCs) (selection) Factor

Unfamiliarity with a situation which is potentially important but which only occurs infrequently or which is novel 17

A shortage of time available for error detection and correction 11

A low signal-noise ratio 10

A means of suppressing or over-riding information or features which is too easily accessible 9

No obvious means of reversing an unintended action 8

A need to unlearn a technique and apply one which requires the application of an opposing philosophy 6

The need to transfer specific knowledge from task to task without loss 5.5

Ambiguity in the required performance standards 5

A means of suppressing or over-riding information or features which is too easily accessible 4

A mismatch between perceived and real risk. 4

No clear, direct and timely confirmation of an intended action from the portion of the system over which control is

exerted.

4

Operator inexperience (e.g., a newly qualified tradesman but not an expert) 3

A mismatch between the educational achievement level of an individual and the requirements of the task 2

Little opportunity to exercise mind and body outside the immediate confines of a job 1.8

Little or no intrinsic meaning in a task 1.4

High level emotional stress 1.3

Evidence of ill-health amongst operatives especially fever. 1.2

Low workforce morale 1.2

A poor or hostile environment 1.15

Prolonged inactivity or highly repetitious cycling of low mental workload tasks (1st half hour) 1.1

(thereafter) 1.05

Disruption of normal work sleep cycles 1.1

Task pacing caused by the intervention of others 1.06

Additional team members over and above those necessary to perform task normally and satisfactorily. (per additional

team member)

1.03

How does it all come together?

• Find out task level:

• (E) Routine, highly practiced, rapid task involving relatively low level of skill

• R=1-0.02

• Are there any EPCs?

• A mismatch between perceived and real risk. E1=4

• Little or no intrinsic meaning in a task E2=1.4

• Additional team members (per member) E3=1.03*3

• Assess proportion of EPC ( ≠ 1) • P1= 0.5, P2=0.2, P3=0.5

• Assess effect = ((E-1)*P)+1

• F1=2.5, F2=1.08, F3=2.045

• Assessed probability of failure

• 0.02*2.5*1.08*2.045=0.11043

Human + software reliability How do they interact?

The operational profile

• A software-based product’s reliability depends on just how

a customer will use it.

• Making a good reliability estimate depends on testing the product

as if it were in the field.

• The operational profile

• quantitative characterization of how a system will be used

• Works also for hardware, human components

• Can be used for the whole system

Who develops an operational profile?

• Developed by: • systems engineers • high-level designers (architecture) • test planners

• Strong participation by: • product planning • marketing professionals • key customers, if available

• Developed by John Musa et al at AT&T to guide testing.

• An AT&T PBX switching system combined an operational profile with other quality improvement techniques.

• Adopted by HP to re-organize system-test process for multi-processor operating system.

• HP system-test process revision reduced system-test time by 50%

• First published results, 1987; active use since.

Overall creation process

• A progressively narrowing perspective from customers down to operation

• At each step, quantify how often each of the elements in that step will be used; convert to probabilities

• Process has been refined many times. Most AT&T applications have been real-time telecommunications systems

• Profile = A set of disjoint alternatives and the probability that each will occur • On the way to creating the operational profile, several intermediate

profiles will be produced

• The usage data is not a profile until you add the probability info

Ex:

• Example:

• 100 X -type transactions an hour

• 500 Y-type transactions an hour

• 300 Z-type transactions an hour

• Interesting but not useful until you know total number of

transactions per hour so you can compute probabilities

• 100/2000 = .05; 500/2000 = .25; 300/2000 = .15

• Completeness check: do the probabilities add up to 1?

• .05 + .25 + .15 = .45 Missing some. • (Note: Can use raw data to re-create appropriate traffic levels in

test.)

• How accurate does this (combination of data and probabilities) need to be?

Degree of Accuracy Required

• What is the economic gain expected from better decisions

resulting from more accurate data? (classic “risk

management” question!)

• In practice, often use “informed engineering judgment”

rather than formal economic analysis

• Emphasis on the word informed

• Infrequently executed functions of a highly critical nature

ARE important, e.g.,

• pilot ejection from cockpit

• overheating nuclear reactor shutdown procedures

• Must incorporate notion of criticality as well as use of

operations

5 steps to create the operational profile

1. customer profile

2. user profile

3. system-mode profile

4. functional profile

5. operational profile

• (based on all of the above)

The O. P. Triangle Customer groups

User groups

System-modes

Functional

Operational

Example: retail store market

• Customer groups: • large retail stores

• small chains

• grocery chains

• User groups: • Cashiers

• marketing analysts

• I-S specialists

• System-modes: • I-S specialists do database cleanup and

also report generation

• Functions • each mode has several functions (e.g.,

various reports in report generation mode)

• Note use of word function is from user perspective, i.e. user task

• Operations: • user functions are mapped onto the

software product’s operations

1. customer profile

2. user profile

3. system-mode profile

4. functional profile

5. operational profile

Notes: Some steps may be unnecessary

Uniformity of detail is not required

Step 1:

Customer

Profile

Customer Occurrence Probability

Educational Institution 0.45

Business Organization 0.35

Individual Home User 0.20 Ex: software spreadsheet

package For instance, schools

might use them for

tabulating and updating

student grades.

Businesses might use

them mainly for financial

and operations controls.

Home users could keep

track of their monthly

income and expenses, as

well as investments and

savings plans.

The customer profile is

the list of customer

types and the

associated probabilities.

These probabilities are

simply the proportions

of time that each type of

customer would be

using the system.

Step 2: User

Profile

The user profile is the

set of user types and

their associated

probabilities of using the

system

Within a customer

group: use the

proportion of customer

group’s usage that the

user group represents

If can’t determine usage,

use the number of users

as proportion of the total

users in that group

Combine same user

groups found in different

customer groups

Customer Occurrence Probability

Educational Institution 0.45

Secretary 50%

Managers 30%

Teachers 20%

Business Organization 0.35

Secretary 40%

Managers 60%

Individual Home User 0.20

Individuals 100%

Step 2: User

Profile

The user profile is the

set of user types and

their associated

probabilities of using the

system

Within a customer

group: use the

proportion of customer

group’s usage that the

user group represents

If can’t determine usage,

use the number of users

as proportion of the total

users in that group

Combine same user

groups found in different

customer groups

User Occurrence Probability

Secretary 0.5*0.45+0.4*0.35=0.365

Managers 0.3*0.45+0.6*0.35=0.345

Teachers 0.2*0.45=0.09

Other individuals 0.2

Step 3: System

Mode Profile

System mode Occurrence Probability

Batch Mode 0.35

User-Interactive Mode 0.65

A system mode is a way that a system can operate. The system includes both hardware and software. Most systems have more than one mode of operation. For example, system testing may take place in batch mode or user-interactive mode. An airplane flight consists of takeoff and ascent mode, level flight mode and descent and land mode. An automobile may be in normal mode or four-wheel drive; it may also be in normal mode or cruise control. System modes can be

thought of as independent segments of a system operation or various different ways of using a

system. A system can switch among modes sequentially, or it can permit several modes to operate concurrently, sharing the same system resources. For each system mode, if there are more than one or two, an operational profile (and sometimes functional profile) should be developed. There are no technical limits on how many system modes may be established

Short recap - Operational Profile Development

• Musa, J.D., “Operational Profiles in Software Reliability

Engineering,” IEEE Software Magazine, March 1993

Functional profile – 1/2

• After a good system mode profile has been developed, the focus should turn to evaluation of each system mode for the functions performed during that mode, and then assigning probabilities to each of the functions.

• Functions • are essentially tasks that an external entity such as a user can perform

with the system. • user of an e-mail system would want the following functions: create

message, look up address, send message, open message

• are based on what activities the customer wants the system to be able to perform. • Developing a functional profile is, in that respect, a part of developing

requirements.

• A functional profile need not have a defined number of functions, but generally contains 20 to more than a hundred. The number will vary based on project size, number of system modes, environmental considerations, and function breadth.

Functional profile – 2/2

• The functional profile can be either explicit or implicit, depending on the key input variables

• A key input variable is an external parameter which affects the execution path a software system traverses based on the different values the parameter takes on.

• consist of ranges of variables that cause different operations to be performed.

• These various ranges are referred to as levels.

• A profile is explicit if each element is designated by simultaneously specifying the levels of all key input variables needed for its identification.

• A profile is implicit if it is expressed by subprofiles of each key variable.

• That is, each key environmental parameter is assigned probabilities associated with the ranges it can legally use.

Example

Implicit Profile

Subprofile C Subprofile D

Key input

variable value

Occurrence

probability

Key input

variable value

Occurrence

probability

X1 0.6 Y1 0.7

X2 0.3 Y2 0.2

X3 0.1 Y3 0.1

Explicit Profile

Key input variable value Occurrence probability

X1Y1 0.42

X2Y1 0.21

X1Y2 0.12

X3Y1 0.07

X1Y3 0.06

X2Y2 0.06

X2Y3 0.03

X3Y2 0.02

X3Y3 0.01

Suppose there are two key independent parameters, X and Y, each taking on three discrete values. Nine operations can be defined based on the combinations of the variables

The main advantage of using the implicit profile is that a significantly smaller number of elements need to be specified, as few as the sum of the number of levels of key input parameters.

The explicit profile can have as many as the product of the number of levels for each variable. For five variables with five levels, assuming complete independence, the implicit profile requires only 25 elements whereas the explicit profile would call for 55, or 3,125 elements.

In most cases it is not necessary to generate the explicit profile, because it exists by default from the implicit profile

How to develop a function list

• Construct work-flow chart showing overall process, including software, hardware, and people.

• The work flow shows the context and suggests necessary functions

• Usually done during requirements phase

• Basic requirements definition: ensure that almost all important input values (commands, their variables, global data) and environment variables are covered by the defined functions.

• Function differentiation is independent of that.

• The more refined the differentiation, the more detailed profile you obtain.

• Wait a second….

It’s not that simple…

1. Generate an initial function list • features and capabilities needed by the users

• organized by functions relevant to each key input variable if an implicit profile is used

2. Determine environmental variables • environmental variables characterize the conditions that influence the paths

traversed by a program, but do not correspond directly to features

• Ex: hardware configuration and traffic load

3. Create final function list • environmental and feature variables should be examined for dependencies

• Partial dependencies can cause difficulties because all possible combinations of levels of both variables may need to be listed

• The final number of functions in the list is then calculated as the product of the number of functions in the initial list and the number of environmental variable levels, minus the combinations of initial functions and environmental variable values that do not occur.

4. Assign occurrence probabilities

Sample final function list

Function Environmental Variable

Standard Deviation X

Y

Correlation X

Y

Analysis of Variance X

Y

Regression X

Y

Functional

Profile

Segment

Final function list

Function

Chi-Square

System

Mode

Occurrence

Probability

Overall

Occurrence

Probability

Standard Deviation 0.60 0.12

Correlation 0.22 0.044

Analysis of Variance 0.10 0.02

Regression 0.08 0.016

Environmental Profile

Variable count Occurrence Probability

One (X) 0.6

Multiple (Y) 0.4

For the assignment of occurrence probabilities, the ideal data source consists of usage measurements taken on the latest release or a similar system. These measurements may be obtained from system logs or data storage devices. Occurrence probabilities computed with the historical data should be updated to account for new functions, users, or environments. In the event that a system is completely new the functional profile might be very inaccurate. It should still be developed, however, and updated as more is known about how the system will be operated. The process of predicting usage forces interaction with the customer, which can be very important. The required dialogue may highlight the relative importance of the various functions, indicating that some functions may not be necessary while others are most significant.

Reducing the number of functions should increase reliability

Final Functional

Profile

Segment

Function

Chi-Square

System Mode

Occurrence

Probability

Overall

Occurrence

Probability

Standard

Deviation X 0.072

Y 0.048

Correlation X 0.0264

Y 0.0176

Analysis of

Variance X 0.012

Y 0.008

Regression X 0.0096

Y 0.0064

All adds up (actually, it

multiplies)

Final step: operational profile

• The functional profile is a user-oriented view of system capabilities. From the developers’ perspective, it is operations that actually implement the functions.

• Operations are usually the focus of testing.

• An operation represents a task being accomplished by the system from the viewpoint of the people who will test the system. To allocate testing effort and develop a test description, the operational profile must be available for the purposes of test planning.

How to get to the operational profile

1. Divide execution into runs

2. Identify input space

3. Partition input space

4. Occurrence Probabilities

• Test selections

• according to their occurrence probabilities

• Prioritize development

• This was a much more radical concept in 1993.

• Attention! operational profile may create an unrealistic set

of tests because the list of operations is too long

Let’s reduce the number of operations

1. Reduce the number of run types.

1. Reduce the size of the input variable list

1. Reduce functionality.

2. Reduce the number of possible hardware configurations.

3. Restrict the environment the program must operate in.

4. Reduce the number of fault types.

5. Reduce unnecessary interactions between successive runs *****

1. Minimize the input variables that application programs can access at any one time.

2. Reinitialize variables between runs.

3. Use synchronous, as opposed to asynchronous, design.

2. Reduce the number of levels of the input variables

2. Increase the number of run types grouped per operation.

3. Ignore the remaining set of run types expected to have total occurrence probability appreciably less than the failure intensity objective

Example • Financial and billing systems are commonly data driven.

• Suppose a cable television billing system was designed as an account processing system. This system processes the charge entries for each account for the current billing period and generates bills. The reliability to evaluate is the probability of generating a correct bill. This involves determining the reliability over the time required to process the bill and its entries.

• Assume that the design was not anticipated when the functional profile was developed, so the relationship between the functional profile and operational profile is complex. For instance, typical functions might have been bill processing, bill correction, and delinquency identification.

• The account-processing system has an operational profile that relates to account attributes. Its operations are classified by customer type (business or residential), service type (basic, expanded basic, premium package), and payment status (paid, delinquent).

• Assume that 90 percent of the customers are residential and 10 percent are businesses. Forty percent of the customers subscribe to the basic cable service. Half of all customers receive expanded basic, and the remaining 10 percent pay for the full premium package. History shows that 2 percent of the accounts are delinquent, on average.

Example

Operation Occurrence Probability

Residential, Expanded Basic, Paid 0.4410

Residential, Basic, Paid 0.3528

Residential, Premium, Paid 0.0882

Business, Expanded, Paid 0.0490

Business, Basic, Paid 0.0392

Business, Premium, Paid 0.0098

Residential, Expanded, Delinquent 0.0090

Residential, Basic, Delinquent 0.0072

Residential, Premium, Delinquent 0.0018

Business, Expanded, Delinquent 0.0010

Business, Basic, Delinquent 0.0008

Business, Premium, Delinquent 0.0002

Operations and the

associated probabilities

SRE Best practices

(short)

Design for reliability

• Functional and Non-functional Requirements • System functional requirements may specify

• error checking

• recovery features

• system failure protection

• Non-functional requirements

• System reliability

• Hardware reliability

• probability a hardware component fails

• Software reliability

• probability a software component will produce an incorrect output

• software does not wear out

• software can continue to operate after a bad result

• Operator reliability

• probability system user makes an error

• Availability

Examples:

• Functional Reliability Requirements • The system will check the all operator inputs to see that they fall within their

required ranges.

• The system will check all disks for bad blocks each time it is booted.

• The system must be implemented in using a standard implementation of Ada

• Non-functional Reliability Specification • The required level of reliability must be expressed quantitatively.

• Reliability is a dynamic system attribute.

• Source code reliability specifications are meaningless (e.g. N faults/1000 LOC)

• An appropriate metric should be chosen to specify the overall system reliability

• Probability of Failure on Demand (POFOD)

• POFOD = 0.001

• For one in every 1000 requests the service fails per time unit

• Rate of Fault Occurrence (ROCOF)

• ROCOF = 0.02

• Two failures for each 100 operational time units of operation

Building

Reliability

Specification

Failure

Class

Example Metric

Permanent

Non-

corrupting

ATM fails to

operate with

any card, must

restart to

correct

ROCOF = .0001

Time unit = days

Transient

Non-

corrupting

Magnetic stripe

can't be read

on undamaged

card

POFOD = .0001

Time unit =

transactions

1. For each sub-

system analyze

consequences of

possible system

failures

2. From system failure

analysis partition

failure into

appropriate classes

3. For each class send

out the appropriate

reliability metric

Specification validation

• It is impossible to empirically validate high reliability

specifications

• No database corruption really means POFOD class < 1 in

200 million

• If each transaction takes 1 second to verify, simulation of

one day’s transactions takes 3.5 days

Statistical Testing

• Test data used, needs to follow typical software usage

patterns

• Measuring numbers of errors needs to be based on errors

of omission (failing to do the right thing) and errors of

commission (doing the wrong thing)

• Uncertainty when creating the operational profile

• High cost of generating the operational profile

• Statistical uncertainty problems when high reliabilities are

specified

Six steps to SRE

1. Quantify product usage by specifying how frequently customers will use various features and how frequently various environmental conditions that influence processing will occur.

2. Define quality quantitatively with your customers by defining failures and failure severities and by specifying the balance among the key quality objectives of reliability, delivery date, and cost to maximize customer satisfaction.

3. Employ product usage data and quality objectives to guide design and implementation of your product and to manage resources to maximize productivity (i.e., customer satisfaction per unit cost).

4. Measure reliability of reused software and acquired software components delivered to you by suppliers, as an acceptance requirement.

5. Track reliability during test and use this information to guide product release.

6. Monitor reliability in field operation and use results to guide new feature introduction, as well as product and process improvement.

Why user opinion matters

• 80% AT&T users – the most important quality attribute = RELIABILITY • AT&T developed the operational profile idea

• SRE will help your project • Satisfy customer needs more precisely.

• Having precise reliability requirements focuses development on meeting your customers’ reliability needs. Reliability requirements enable system testers to concretely verify that the finished product meets customers’ needs before it is released.

• Deliver earlier.

• Delivering the exact reliability needed by the customer avoids wasting time for unneeded extra testing.

• Increase productivity.

• By using the functional and operational profiles to focus resources on the high-usage functions or operations and by developing and testing for exactly the reliability needed, productivity is improved.

• Plan project resources better.

• Before testing begins, SRE supports prediction of the amount of system test resources needed, avoiding unnecessary waste and disruption due to unpleasant surprises.

SRE activities

•Determine functional profile

•Define and classify failures

•Identify customer reliability needs

•Conduct trade-oft studios

•Set reliability objectives

Feasibility; Requirements and Development plan

•Allocate reliability among components

•Engineer to meet reliability objectives

•Focus resources based on functional profile

•Manage fault introduction and propagation

•Measure reliability of acquired software

Design and Implementation

•Determine operational profile

•Conduct reliability growth testing

•Track testing progress

•Project additional testing needed

•Certify reliability objectives are met

System test and Field Trial

•Project post-release staff needs

•Monitor field reliability vs. objectives

•Track customer satisfaction with reliability

•Time new feature introduction by monitoring reliability

•Guide product and process improvement with reliability measures

Post Delivery; Operation and Maintenance

Cost and release date trade-offs

People involved in SRE

Dependability: What is it?staff.cs.upt.ro/~vancusa/fsc/c3.pdf · Factors Influencing Software Reliability •A user’s perception of the reliability of a software depends upon two

Documents

Dependability: What is it?staff.cs.upt.ro/~vancusa/fsc/c3.pdf · Factors Influencing Software Reliability •A user’s perception of the reliability of a software depends upon two