SOFTWARE
RELIABILITY Part 2
Where are we now?
• Previous course
• System vs software reliability
• Model
• Module vs operation mode
• Software Reliability Prediction
• Metrics
• Software FRACAS
• Musa model
• This course
• Operational profile
• Human reliability
• SRE best practices
Factors Influencing Software Reliability
• A user’s perception of the reliability of a software
depends upon two categories of information.
• The number of faults present in the software.
• The ways users operate the system.
• This is known as the operational profile.
• The fault count in a system is influenced by the following.
• Size and complexity of code
• Characteristics of the development process used
• Education, experience, and training of development personnel
• Operational environment
Human error analysis and reliability
assessment • explore difficulties of use early in design with the aim of
improving design
• hence comparable with other usability and walkthrough techniques
• assessing likelihood of human error of a developed
design as part of an assessment process
• hence comparable with other reliability assessment techniques
• “It must be shown by analysis, substantiated where
necessary by test, that as far as reasonably practicable all
design precautions have been taken to prevent human
errors in production, maintenance and operation causing
hazardous or catastrophic effect”
• Used extensively in the nuclear power industry
Different approaches
Engineering approach
•Quantitative ‘decomposition’
•Human treated as a “component”
•The mechanistic assumption: “The human / mind as a fallible machine”
•The atomistic assumption: Human performance can be adequately described by considering individual elements of the performance. Total performance is an aggregate of the individual performance elements
Cognitive approach
•models and theories of cognitive functions which underlie human behavior
•Cognitive psychology still immature
•Problem: human cognition is not directly observable
Quantification techniques
• HEART: a human performance model-based technique
utilizing some standard probabilities
• A data-based method for assessing and reducing human error to
improve operational performance.
• J.C. Williams (1988) IEEE Fourth Conference on Human Factors and
Power Plants (pp.436-450)
• Based on long-term sizeable human reliability database; weighting
factors based on HF literature.
• Assumes human performance usually deteriorates when Error
Producing Conditions (EPCs) interact
• SLIM: a utility-based technique using team based
judgments
• THERP: earliest method
HEART generic categories
Generic Task Nominal human
unreliability
*5th-95th
percentile
bounds
(A) Totally unfamiliar, performed at speed with no real idea of likely
consequences 0.55 (0.35-0.97)
(B) Shift or restore system to a new or original state on a single attempt without
supervision or procedures 0.26 (0.14-0.42)
(C) Complex task requiring high level of comprehension and skill 0.16 (0.12 - 0.28)
(D) Fairly simple task performed rapidly or given scant attention 0.09 (0.06 - 0.13)
(E) Routine, highly practiced, rapid task involving relatively low level of skill 0.02 (0.007 -
0.045)
(F) Restore or shift a system to original or new state following procedures, with
some checking 0.003 (0.0008 -
0.007)
(G) Completely familiar, well-designed, highly practiced, routine task occurring
several times per hour, performed to highest possible standards by highly
motivated, highly-trained
and experienced personnel, with time to correct potential error, but without the
benefit of significant job aids
0.0004 (0.00008 -
0.009)
(H) Respond correctly to system command even when there is an augmented or
automated supervisory system providing accurate interpretation of system state 0.00002 (0.000006 -
0.0009)
Error producing Conditions (EPCs) (selection) Factor
Unfamiliarity with a situation which is potentially important but which only occurs infrequently or which is novel 17
A shortage of time available for error detection and correction 11
A low signal-noise ratio 10
A means of suppressing or over-riding information or features which is too easily accessible 9
No obvious means of reversing an unintended action 8
A need to unlearn a technique and apply one which requires the application of an opposing philosophy 6
The need to transfer specific knowledge from task to task without loss 5.5
Ambiguity in the required performance standards 5
A means of suppressing or over-riding information or features which is too easily accessible 4
A mismatch between perceived and real risk. 4
No clear, direct and timely confirmation of an intended action from the portion of the system over which control is
exerted.
4
Operator inexperience (e.g., a newly qualified tradesman but not an expert) 3
A mismatch between the educational achievement level of an individual and the requirements of the task 2
Little opportunity to exercise mind and body outside the immediate confines of a job 1.8
Little or no intrinsic meaning in a task 1.4
High level emotional stress 1.3
Evidence of ill-health amongst operatives especially fever. 1.2
Low workforce morale 1.2
A poor or hostile environment 1.15
Prolonged inactivity or highly repetitious cycling of low mental workload tasks (1st half hour) 1.1
(thereafter) 1.05
Disruption of normal work sleep cycles 1.1
Task pacing caused by the intervention of others 1.06
Additional team members over and above those necessary to perform task normally and satisfactorily. (per additional
team member)
1.03
How does it all come together?
• Find out task level:
• (E) Routine, highly practiced, rapid task involving relatively low level of skill
• R=1-0.02
• Are there any EPCs?
• A mismatch between perceived and real risk. E1=4
• Little or no intrinsic meaning in a task E2=1.4
• Additional team members (per member) E3=1.03*3
• Assess proportion of EPC ( ≠ 1) • P1= 0.5, P2=0.2, P3=0.5
• Assess effect = ((E-1)*P)+1
• F1=2.5, F2=1.08, F3=2.045
• Assessed probability of failure
• 0.02*2.5*1.08*2.045=0.11043
Human + software reliability How do they interact?
The operational profile
• A software-based product’s reliability depends on just how
a customer will use it.
• Making a good reliability estimate depends on testing the product
as if it were in the field.
• The operational profile
• quantitative characterization of how a system will be used
• Works also for hardware, human components
• Can be used for the whole system
Who develops an operational profile?
• Developed by: • systems engineers • high-level designers (architecture) • test planners
• Strong participation by: • product planning • marketing professionals • key customers, if available
• Developed by John Musa et al at AT&T to guide testing.
• An AT&T PBX switching system combined an operational profile with other quality improvement techniques.
• Adopted by HP to re-organize system-test process for multi-processor operating system.
• HP system-test process revision reduced system-test time by 50%
• First published results, 1987; active use since.
Overall creation process
• A progressively narrowing perspective from customers down to operation
• At each step, quantify how often each of the elements in that step will be used; convert to probabilities
• Process has been refined many times. Most AT&T applications have been real-time telecommunications systems
• Profile = A set of disjoint alternatives and the probability that each will occur • On the way to creating the operational profile, several intermediate
profiles will be produced
• The usage data is not a profile until you add the probability info
Ex:
• Example:
• 100 X -type transactions an hour
• 500 Y-type transactions an hour
• 300 Z-type transactions an hour
• Interesting but not useful until you know total number of
transactions per hour so you can compute probabilities
• 100/2000 = .05; 500/2000 = .25; 300/2000 = .15
• Completeness check: do the probabilities add up to 1?
• .05 + .25 + .15 = .45 Missing some. • (Note: Can use raw data to re-create appropriate traffic levels in
test.)
• How accurate does this (combination of data and probabilities) need to be?
Degree of Accuracy Required
• What is the economic gain expected from better decisions
resulting from more accurate data? (classic “risk
management” question!)
• In practice, often use “informed engineering judgment”
rather than formal economic analysis
• Emphasis on the word informed
• Infrequently executed functions of a highly critical nature
ARE important, e.g.,
• pilot ejection from cockpit
• overheating nuclear reactor shutdown procedures
• Must incorporate notion of criticality as well as use of
operations
5 steps to create the operational profile
1. customer profile
2. user profile
3. system-mode profile
4. functional profile
5. operational profile
• (based on all of the above)
The O. P. Triangle Customer groups
User groups
System-modes
Functional
Operational
Example: retail store market
• Customer groups: • large retail stores
• small chains
• grocery chains
• User groups: • Cashiers
• marketing analysts
• I-S specialists
• System-modes: • I-S specialists do database cleanup and
also report generation
• Functions • each mode has several functions (e.g.,
various reports in report generation mode)
• Note use of word function is from user perspective, i.e. user task
• Operations: • user functions are mapped onto the
software product’s operations
1. customer profile
2. user profile
3. system-mode profile
4. functional profile
5. operational profile
Notes: Some steps may be unnecessary
Uniformity of detail is not required
Step 1:
Customer
Profile
Customer Occurrence Probability
Educational Institution 0.45
Business Organization 0.35
Individual Home User 0.20 Ex: software spreadsheet
package For instance, schools
might use them for
tabulating and updating
student grades.
Businesses might use
them mainly for financial
and operations controls.
Home users could keep
track of their monthly
income and expenses, as
well as investments and
savings plans.
The customer profile is
the list of customer
types and the
associated probabilities.
These probabilities are
simply the proportions
of time that each type of
customer would be
using the system.
Step 2: User
Profile
The user profile is the
set of user types and
their associated
probabilities of using the
system
Within a customer
group: use the
proportion of customer
group’s usage that the
user group represents
If can’t determine usage,
use the number of users
as proportion of the total
users in that group
Combine same user
groups found in different
customer groups
Customer Occurrence Probability
Educational Institution 0.45
Secretary 50%
Managers 30%
Teachers 20%
Business Organization 0.35
Secretary 40%
Managers 60%
Individual Home User 0.20
Individuals 100%
Step 2: User
Profile
The user profile is the
set of user types and
their associated
probabilities of using the
system
Within a customer
group: use the
proportion of customer
group’s usage that the
user group represents
If can’t determine usage,
use the number of users
as proportion of the total
users in that group
Combine same user
groups found in different
customer groups
User Occurrence Probability
Secretary 0.5*0.45+0.4*0.35=0.365
Managers 0.3*0.45+0.6*0.35=0.345
Teachers 0.2*0.45=0.09
Other individuals 0.2
Step 3: System
Mode Profile
System mode Occurrence Probability
Batch Mode 0.35
User-Interactive Mode 0.65
A system mode is a way that a system can operate. The system includes both hardware and software. Most systems have more than one mode of operation. For example, system testing may take place in batch mode or user-interactive mode. An airplane flight consists of takeoff and ascent mode, level flight mode and descent and land mode. An automobile may be in normal mode or four-wheel drive; it may also be in normal mode or cruise control. System modes can be
thought of as independent segments of a system operation or various different ways of using a
system. A system can switch among modes sequentially, or it can permit several modes to operate concurrently, sharing the same system resources. For each system mode, if there are more than one or two, an operational profile (and sometimes functional profile) should be developed. There are no technical limits on how many system modes may be established
Short recap - Operational Profile Development
• Musa, J.D., “Operational Profiles in Software Reliability
Engineering,” IEEE Software Magazine, March 1993
Functional profile – 1/2
• After a good system mode profile has been developed, the focus should turn to evaluation of each system mode for the functions performed during that mode, and then assigning probabilities to each of the functions.
• Functions • are essentially tasks that an external entity such as a user can perform
with the system. • user of an e-mail system would want the following functions: create
message, look up address, send message, open message
• are based on what activities the customer wants the system to be able to perform. • Developing a functional profile is, in that respect, a part of developing
requirements.
• A functional profile need not have a defined number of functions, but generally contains 20 to more than a hundred. The number will vary based on project size, number of system modes, environmental considerations, and function breadth.
Functional profile – 2/2
• The functional profile can be either explicit or implicit, depending on the key input variables
• A key input variable is an external parameter which affects the execution path a software system traverses based on the different values the parameter takes on.
• consist of ranges of variables that cause different operations to be performed.
• These various ranges are referred to as levels.
• A profile is explicit if each element is designated by simultaneously specifying the levels of all key input variables needed for its identification.
• A profile is implicit if it is expressed by subprofiles of each key variable.
• That is, each key environmental parameter is assigned probabilities associated with the ranges it can legally use.
Example
Implicit Profile
Subprofile C Subprofile D
Key input
variable value
Occurrence
probability
Key input
variable value
Occurrence
probability
X1 0.6 Y1 0.7
X2 0.3 Y2 0.2
X3 0.1 Y3 0.1
Explicit Profile
Key input variable value Occurrence probability
X1Y1 0.42
X2Y1 0.21
X1Y2 0.12
X3Y1 0.07
X1Y3 0.06
X2Y2 0.06
X2Y3 0.03
X3Y2 0.02
X3Y3 0.01
Suppose there are two key independent parameters, X and Y, each taking on three discrete values. Nine operations can be defined based on the combinations of the variables
The main advantage of using the implicit profile is that a significantly smaller number of elements need to be specified, as few as the sum of the number of levels of key input parameters.
The explicit profile can have as many as the product of the number of levels for each variable. For five variables with five levels, assuming complete independence, the implicit profile requires only 25 elements whereas the explicit profile would call for 55, or 3,125 elements.
In most cases it is not necessary to generate the explicit profile, because it exists by default from the implicit profile
How to develop a function list
• Construct work-flow chart showing overall process, including software, hardware, and people.
• The work flow shows the context and suggests necessary functions
• Usually done during requirements phase
• Basic requirements definition: ensure that almost all important input values (commands, their variables, global data) and environment variables are covered by the defined functions.
• Function differentiation is independent of that.
• The more refined the differentiation, the more detailed profile you obtain.
• Wait a second….
It’s not that simple…
1. Generate an initial function list • features and capabilities needed by the users
• organized by functions relevant to each key input variable if an implicit profile is used
2. Determine environmental variables • environmental variables characterize the conditions that influence the paths
traversed by a program, but do not correspond directly to features
• Ex: hardware configuration and traffic load
3. Create final function list • environmental and feature variables should be examined for dependencies
• Partial dependencies can cause difficulties because all possible combinations of levels of both variables may need to be listed
• The final number of functions in the list is then calculated as the product of the number of functions in the initial list and the number of environmental variable levels, minus the combinations of initial functions and environmental variable values that do not occur.
4. Assign occurrence probabilities
Sample final function list
Function Environmental Variable
Standard Deviation X
Y
Correlation X
Y
Analysis of Variance X
Y
Regression X
Y
Functional
Profile
Segment
Final function list
Function
Chi-Square
System
Mode
Occurrence
Probability
Overall
Occurrence
Probability
Standard Deviation 0.60 0.12
Correlation 0.22 0.044
Analysis of Variance 0.10 0.02
Regression 0.08 0.016
Environmental Profile
Variable count Occurrence Probability
One (X) 0.6
Multiple (Y) 0.4
For the assignment of occurrence probabilities, the ideal data source consists of usage measurements taken on the latest release or a similar system. These measurements may be obtained from system logs or data storage devices. Occurrence probabilities computed with the historical data should be updated to account for new functions, users, or environments. In the event that a system is completely new the functional profile might be very inaccurate. It should still be developed, however, and updated as more is known about how the system will be operated. The process of predicting usage forces interaction with the customer, which can be very important. The required dialogue may highlight the relative importance of the various functions, indicating that some functions may not be necessary while others are most significant.
Reducing the number of functions should increase reliability
Final Functional
Profile
Segment
Function
Chi-Square
System Mode
Occurrence
Probability
Overall
Occurrence
Probability
Standard
Deviation X 0.072
Y 0.048
Correlation X 0.0264
Y 0.0176
Analysis of
Variance X 0.012
Y 0.008
Regression X 0.0096
Y 0.0064
All adds up (actually, it
multiplies)
Final step: operational profile
• The functional profile is a user-oriented view of system capabilities. From the developers’ perspective, it is operations that actually implement the functions.
• Operations are usually the focus of testing.
• An operation represents a task being accomplished by the system from the viewpoint of the people who will test the system. To allocate testing effort and develop a test description, the operational profile must be available for the purposes of test planning.
How to get to the operational profile
1. Divide execution into runs
2. Identify input space
3. Partition input space
4. Occurrence Probabilities
• Test selections
• according to their occurrence probabilities
• Prioritize development
• This was a much more radical concept in 1993.
• Attention! operational profile may create an unrealistic set
of tests because the list of operations is too long
Let’s reduce the number of operations
1. Reduce the number of run types.
1. Reduce the size of the input variable list
1. Reduce functionality.
2. Reduce the number of possible hardware configurations.
3. Restrict the environment the program must operate in.
4. Reduce the number of fault types.
5. Reduce unnecessary interactions between successive runs *****
1. Minimize the input variables that application programs can access at any one time.
2. Reinitialize variables between runs.
3. Use synchronous, as opposed to asynchronous, design.
2. Reduce the number of levels of the input variables
2. Increase the number of run types grouped per operation.
3. Ignore the remaining set of run types expected to have total occurrence probability appreciably less than the failure intensity objective
Example • Financial and billing systems are commonly data driven.
• Suppose a cable television billing system was designed as an account processing system. This system processes the charge entries for each account for the current billing period and generates bills. The reliability to evaluate is the probability of generating a correct bill. This involves determining the reliability over the time required to process the bill and its entries.
• Assume that the design was not anticipated when the functional profile was developed, so the relationship between the functional profile and operational profile is complex. For instance, typical functions might have been bill processing, bill correction, and delinquency identification.
• The account-processing system has an operational profile that relates to account attributes. Its operations are classified by customer type (business or residential), service type (basic, expanded basic, premium package), and payment status (paid, delinquent).
• Assume that 90 percent of the customers are residential and 10 percent are businesses. Forty percent of the customers subscribe to the basic cable service. Half of all customers receive expanded basic, and the remaining 10 percent pay for the full premium package. History shows that 2 percent of the accounts are delinquent, on average.
Example
Operation Occurrence Probability
Residential, Expanded Basic, Paid 0.4410
Residential, Basic, Paid 0.3528
Residential, Premium, Paid 0.0882
Business, Expanded, Paid 0.0490
Business, Basic, Paid 0.0392
Business, Premium, Paid 0.0098
Residential, Expanded, Delinquent 0.0090
Residential, Basic, Delinquent 0.0072
Residential, Premium, Delinquent 0.0018
Business, Expanded, Delinquent 0.0010
Business, Basic, Delinquent 0.0008
Business, Premium, Delinquent 0.0002
Operations and the
associated probabilities
SRE Best practices
(short)
Design for reliability
• Functional and Non-functional Requirements • System functional requirements may specify
• error checking
• recovery features
• system failure protection
• Non-functional requirements
• System reliability
• Hardware reliability
• probability a hardware component fails
• Software reliability
• probability a software component will produce an incorrect output
• software does not wear out
• software can continue to operate after a bad result
• Operator reliability
• probability system user makes an error
• Availability
Examples:
• Functional Reliability Requirements • The system will check the all operator inputs to see that they fall within their
required ranges.
• The system will check all disks for bad blocks each time it is booted.
• The system must be implemented in using a standard implementation of Ada
• Non-functional Reliability Specification • The required level of reliability must be expressed quantitatively.
• Reliability is a dynamic system attribute.
• Source code reliability specifications are meaningless (e.g. N faults/1000 LOC)
• An appropriate metric should be chosen to specify the overall system reliability
• Probability of Failure on Demand (POFOD)
• POFOD = 0.001
• For one in every 1000 requests the service fails per time unit
• Rate of Fault Occurrence (ROCOF)
• ROCOF = 0.02
• Two failures for each 100 operational time units of operation
Building
Reliability
Specification
Failure
Class
Example Metric
Permanent
Non-
corrupting
ATM fails to
operate with
any card, must
restart to
correct
ROCOF = .0001
Time unit = days
Transient
Non-
corrupting
Magnetic stripe
can't be read
on undamaged
card
POFOD = .0001
Time unit =
transactions
1. For each sub-
system analyze
consequences of
possible system
failures
2. From system failure
analysis partition
failure into
appropriate classes
3. For each class send
out the appropriate
reliability metric
Specification validation
• It is impossible to empirically validate high reliability
specifications
• No database corruption really means POFOD class < 1 in
200 million
• If each transaction takes 1 second to verify, simulation of
one day’s transactions takes 3.5 days
Statistical Testing
• Test data used, needs to follow typical software usage
patterns
• Measuring numbers of errors needs to be based on errors
of omission (failing to do the right thing) and errors of
commission (doing the wrong thing)
• Uncertainty when creating the operational profile
• High cost of generating the operational profile
• Statistical uncertainty problems when high reliabilities are
specified
Six steps to SRE
1. Quantify product usage by specifying how frequently customers will use various features and how frequently various environmental conditions that influence processing will occur.
2. Define quality quantitatively with your customers by defining failures and failure severities and by specifying the balance among the key quality objectives of reliability, delivery date, and cost to maximize customer satisfaction.
3. Employ product usage data and quality objectives to guide design and implementation of your product and to manage resources to maximize productivity (i.e., customer satisfaction per unit cost).
4. Measure reliability of reused software and acquired software components delivered to you by suppliers, as an acceptance requirement.
5. Track reliability during test and use this information to guide product release.
6. Monitor reliability in field operation and use results to guide new feature introduction, as well as product and process improvement.
Why user opinion matters
• 80% AT&T users – the most important quality attribute = RELIABILITY • AT&T developed the operational profile idea
• SRE will help your project • Satisfy customer needs more precisely.
• Having precise reliability requirements focuses development on meeting your customers’ reliability needs. Reliability requirements enable system testers to concretely verify that the finished product meets customers’ needs before it is released.
• Deliver earlier.
• Delivering the exact reliability needed by the customer avoids wasting time for unneeded extra testing.
• Increase productivity.
• By using the functional and operational profiles to focus resources on the high-usage functions or operations and by developing and testing for exactly the reliability needed, productivity is improved.
• Plan project resources better.
• Before testing begins, SRE supports prediction of the amount of system test resources needed, avoiding unnecessary waste and disruption due to unpleasant surprises.
SRE activities
•Determine functional profile
•Define and classify failures
•Identify customer reliability needs
•Conduct trade-oft studios
•Set reliability objectives
Feasibility; Requirements and Development plan
•Allocate reliability among components
•Engineer to meet reliability objectives
•Focus resources based on functional profile
•Manage fault introduction and propagation
•Measure reliability of acquired software
Design and Implementation
•Determine operational profile
•Conduct reliability growth testing
•Track testing progress
•Project additional testing needed
•Certify reliability objectives are met
System test and Field Trial
•Project post-release staff needs
•Monitor field reliability vs. objectives
•Track customer satisfaction with reliability
•Time new feature introduction by monitoring reliability
•Guide product and process improvement with reliability measures
Post Delivery; Operation and Maintenance
Cost and release date trade-offs
People involved in SRE