VALUE-BASED, DEPENDENCY-AWARE INSPECTION AND TEST PRIORITIZATION by Qi Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2012 Copyright 2012 Qi Li
175
Embed
VALUE-BASED, DEPENDENCY-AWARE INSPECTION AND …csse.usc.edu/TECHRPTS/PhD_Dissertations/files/Qi Li Dissertation.pdf · List of Tables Table 1. Comparsion Results of Value-based Group
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
VALUE-BASED, DEPENDENCY-AWARE INSPECTION AND TEST
PRIORITIZATION
by
Qi Li
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the Requirements for the Degree
DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE)
December 2012
Copyright 2012 Qi Li
ii
Dedication
To my parents
iii
Acknowledgements
My Ph.D dissertation could not be completed without the support of many hearts
and minds. I am deeply indebted to my Ph.D advisor Dr. Barry Boehm, for his great and
generous support for all my Ph.D research. I am deeply honored to be one of his students
and get direct and close advice from him all the time. My sincere thanks are also
extended to other committee members Dr. Stan Settles, Dr. Nenad Medvidovic, Dr.
Richard Selby, Dr. William Halfond and Dr. Sunita Chulani, for the invaluable guidance
on focusing my research and efforts on reviewing drafts of my dissertation.
Special thanks to my ISCAS advisors, Professor Mingshu Li, Professor Qing
Wang, and Professor Ye Yang. They led me into the academic world and cont inuously
encourage, support my research, and promote the in-depth collaborative research in our
joint lab of USC-CSSE & ISCAS.
The realization of this research effort also exists because of the tremendous
support from Dr. Jo Ann Lane and Dr. Ricardo Valerdi. In addition, this research could
not have been conducted without support from the University of Southern California
Center for Systems and Software Engineering courses, corporate, and academic
affiliates, especial thanks to Galorath Incorporated, NFS-China for giving me the chance
to apply this research into the real industrial projects , to USC-CSSE graduate-level
software engineering courses 577ab Year 2009-2011 students for their collaborative
effort on the Value-based Inspection and Testing experiments, to all my USC and ISCAS
colleagues and friends, life could not be more colorful without you.
Lastly, from the bottom of my heart, I would like to thank my family for their
unconditional love and support during my study.
iv
Table of Contents
Dedication................................................................................................................... ii
Acknowledgements .................................................................................................... iii
ICSM phases: VC: Valuation Commitment, FC: Foundation Commitment, DC: Development
Commitment, TRR: Transition Readiness Review, RDC: Rebaselined Development Commitment, IOC: Initial Operational Capability, TS: Transition & Support
Artifacts developed and reviewed for this course:
OCD: Operational Concept Description, SSRD: System and Software Requirements Description,
SSAD: System and Software Architecture Description, LCP: Life Cycle Plan, FED: Feasibility Evidence Description, SID: Supporting Information Document, QMP :Quality Management Plan, IP: Iteration Plan, IAR: Iteration Assessment Report, TP: Transition Plan, TPC: Test Plan and Cases, TPR: Test Procedures and Result, UM: User Manual, SP: Support Plan, TM: Training Materials
PRO Most critical/important use cases 100% 100% 100%
SID 100% 100% 100% 100%
QMP N/A N/A Section 1,2 100%
ATPC N/A N/A N/A 100%
IP N/A N/A N/A 100%
Year 2009 teams used a value-neutral formal V&V process (FV&V) to reviewing
the three artifact packages, a variant of Fagan inspection [Fagan, 1976] practice. The steps
they followed are:
46
Table 10. Value-neutral Formal V&V process
Step 1: Create Exit Criteria: From the original team assignment’s description and the related ICSM EPG completion criteria, generate a set of exit criteria that identify what needs to be present and the standard for acceptance of each document.
Step 2: Review and Report Concerns: Based upon the exit criteria, read (review) the documents and report concerns and issues into the Bugzilla [USC_CSSE_Bugzilla] system.
Step 3: Generate Evaluation Report
Management Overview - List any features of the solution described in this artifact that are particularly good, of which a non–technical client should be aware of.
Technical Details - List any features of the solution described in this artifact that you feel are particularly good, and which a technical reviewer should be aware of.
Major Errors & Omissions - List top 3 errors or omissions in the solution described in this artifact
that a non–technical client would care about. The description of an error (or omission) should be understandable to a non–technical client, and should explain why the error is worth the client’s attention.
Critical Concerns - List top 3 concerns with the solution described in this artifact that a non–
technical client would care about. The description of the concern should be understandable to a non–technical client, and should explain why the client should be aware of it. You should also suggest step(s) to take that would reduce or eliminate your concern.
Year 2010 and 2011 teams applied the value-based, dependency-aware
prioritization strategy to the review process with the guidelines for inspection as
summarized as in Table 11.
47
Table 11. Value-based V&V process
Step 1: Value-based V&V Artifacts Prioritization
Priority Factor Rating Guideline
Importance
5: most important
3: normal
1: least important
Without this document, the project can’t move forward or could even fail; it should be rated with high importance
Some documents serve a supporting function. Without them, the project still could move on; this kind of document should be rated with lower importance
Quality Risk
5: highly risky
3: normal
1: least risky
Based on previous reviews, the documents with intensive defects might be still
fault-prone, so this indicates a high quality risk
Personnel factors, e.g. the author of this documents is not proficient or motivated enough; this indicates a high quality risk
A more complex document might have a high quality risk
A new document or an old document with a large portion of newly added sections might have a high quality risk
Dependency
5: highly dependent
3:normal
1: not dependent
Sometimes some lower-priority artifacts are required to be reviewed at least for
reference before reviewing a higher-priority one. For example, in order to review SSAD or TPC, SSRD is required for reference.
Basically, the more documents this document depends on, the higher the Dependency rating is, and the lower the reviewing priority will be
Review Cost
5: need intensive effort
3: need moderate effort
1: need little effort
A new document or an old document with a large portion of newly added sections
usually takes more time to review and vice versa
A more complex document usually takes more time to review and vice versa
Determine Weights
Weights for each factor (Importance, Quality Risk, Review Cost, and
Dependency) could be set according to the project context. Default values are 1.0 for each factor
Priority Calculation
E.g: for a document, Importance=5, Quality Risk=3, Review Cost=2, Dependency = 1, default weights are used=> Priority= (5*3)/(2*1)=7.5
A spreadsheet [USC_577a_VBV&VPS, 2010] helps to calculate the priority automatically, 5-level ratings for each factor are VH, H, M, L VL with values from 5 to 1, intermediate values 2, 4 are also allowed.
Step 2: Review artifacts based on prioritization and report defects/issues
The one with higher priority value should be reviewed first
For each document’s review, review the core part of the document first. Report issues into the Bugzilla [USC_CSSE_Bugzilla] Step 3: List top 10 defects/ issues
List top 10 highest-risk defects or issues based on issues’ priority and severity
48
A real example of artifacts prioritization in one package review by a 2010-team
[USC_577a_VBV&VAPE, 2010] is displayed in Table 12. The default weight of 1.0 for
each factor is used. Based on the priority calculated, reviewing order follows SSRD, OCD,
PRO, SSAD, LCP, FED, SID. SSRD has the highest reviewing priority with the
rationales provided: SSRD contains the requirements of the system, without this
document, the project can't move forward or could even fail (Very-High Importance). This
is a complex document, and needs to be consistent with win conditions negotiation, which
might not be complete at this point, also, a lot of rework was required based on comments
from TA (Very-High Quality Risk). SSRD depends on few other artifacts (Low
Dependency). This is an old document, but it is complex with a lot of rework (Very-High
Review Cost).
Table 12. An example of value-based artifact prioritization
the life cycl e plan of the project. This document
serves as supporting function, without this, the
project still could move on. With his document, the
project could move more smoothly.
L
Based on previous
reviews, the author of this document has a
strong sense of responsibility.
L
M
A lot of new
sections added, but this
document is not very complex.
1.00
OCD
H
This document gives the
overall operational concept of the system. This
document is important, but it is not critical for this success of the system.
VH
This is a complex
document and a lot of the sections in this
document needed to be redone based on the
comments received from the TA.
M
SSRD
H
Old document,
but a lot of rework done. 1.67
49
FED
H
This document should be rated high because it
provides feasibility evidence for the project.
Without this document, we don't know whether the project is feasible.
H
The author of this document does not
have appropri ate time to complete this
document with quality work.
H
SSRD, SSAD
H
A lot of new section added to
this version of the document.
1.00
SSRD
VH
This document contains
the requirements of the system. Without this
document, the project can't move forward or even fail.
VH
This is a complex
document. This document needs to be
consistent with win conditions negotiation,
which might not be complete at this point.
Also, a lot of rework was required based on comments from TA.
L
VH
This is an old
document, but it is complex with a lot of rework.
2.50
SSAD
VH
This document contains
the architecture of the system. Without this
document, the project can't move forward or even fail.
VH
This is a complex
document and it is a new document. The
author of this document did not
know that this document was due
until the morning of the due date.
H
SSRD, OCD
VH
This is an old
document, but it is complex with
a lot of rework done for this version.
1.25
SID
VL
This document serves as supporting function,
without this document, the project still could move on,
but the project could move on more smoothly with this document.
L
This is an old document. Only
additions made to existing sections.
VH
OCD, SSRD, FED, LCP, SSAD, PRO
VL
This is an old document and
this document has no technical contents.
0.40
PRO
H
Without this document, the project can probably move
forward, but the system might not be what the
customer is expecting. This document allows the
customer to have a glimpse of the system.
L
This is an old document with little
new contents. The author has a high
sense of responsibility and he fixed bugs
from the last review in reasonable time.
M
FED
L
This is an old document with
little content added since last
version and not much rework required.
1.33
An example of Top 10 issues made by this team for CoreFCP evaluation is
displayed in Table 13. These Top 10 issues are communicated in a timely manner with
artifact authors to attract enough emphasis. The interesting finding is the relations between
50
the artifact priority sequence and the top 10 issues sequence: the issues with higher impact
usually exist in the artifacts with high priority, showing that the artifact prioritization
enables reviewers to focus on issues with high impact at least in this context. However, it
also helps avoid the potential problem of neglecting high-impact issues in lower-priority
artifacts, as in Issues 8 and 10.
Table 13. An example of Top 10 Issues
Summary Rationale
1 SSRD Missing important requirements.
A lot of important requirements are missing. Without these requirements, the system will not succeed.
2 SSRD Requirement supporting information too generic.
The output, destination, precondition, and post condition should be defined better. These description will allows the development team and the client better understand the requirements. This is important for system success.
3 SSAD Wrong cardinality in the system context diagram.
The cardinality of this diagram needs to be accurate since
this describes the top level of the system context. This is important for system success.
4 OCD The client and client advisor stakeholders should be concentrating
on the deployment benefits.
It is important for that this benefits chain diagram accurately shows the benefits of the system during deployment in order for the client to show to potential
investor to gather fund to support the continuation of system development.
5 OCD The system boundary
and environment missing support infrastructure.
It is important for the System boundary and environment
diagram to capture all necessary support infrastructure in order for the team to consider all risks and requirements related the system support infrastructure.
6 FED Missing use case references in the FED.
Capability feasibility table proves the feasibility of all system capabilities to date. Reference to the use case is
important for the important stakeholders to understand the capabilities and their feasibility.
7 FED Incorrect mitigation plan.
Mitigation plans for project risks are important to overcome the risks. This is important for system success.
8 LCP Missing skills and roles The LCP did not identify the skill required and roles for next semester. This information is important for the success of the project because the team next semester can
use these information and recruit new team members meeting the identified needed skills.
51
9 FED CR# in FED doesn't match with CR# in SSRD
The CR numbers need to match in both FED and SSRD for correct requirement references.
10 LCP COCOMO drivers rework
COCOMO driver values need to be accurate to have a better estimate for the client.
The three-year experiment issue data for the evaluation of CoreFCP, DraftFCP and
FC/DCP from total 35 teams is collected and extracted from the Bugzilla database. The
generic term “Issue” covers both “Concerns” and “Problems”. If the IV&Vers find any
issue, they report it as a “Concern” in Bugzilla and assign it to the relevant artifact author.
The author determines whether the concern is a problem or not.
As transformed in Table 14, Severity is rated from High (corresponding to ratings
of Blocker, Critical, Major in Bugzilla ), Medium (corresponding the rating of Normal in
Bugzilla), Low ( the ratings of Minor, Trivial, Enhancement in Bugzilla) with the value
from 3 to 1. Priority is rated from High (Resolve Immediately), Medium (Normal Queue),
Low (Not Urgent, Low Priority, Resolved Later) with the value from 3 to 1. The Impact of
an issue is the product of its Severity and Priority. The impact of an issue with high
severity and high priority is 9. Obviously, the impact of an issue is an element in the set
{1, 2, 3, 4, 6, and 9}.
52
Table 14. Issue Severity & Priority rate mapping
Rating for
Measurement
Rating in Bugzilla Value
Severity
High Blocker, Critical, Major 3
Medium Normal 2
Low Minor, Trivial, Enhancement 1
Priority
High Resolve Immediately 3
Medium Normal Queue 2
Low Not Urgent, Low Priority,
Resolved Later 1
The generic term “Issue” covers both “Concerns” and “Problems”. If the IV&Vers
find any issue, they report it as a “Concern” in Bugzilla and assign it to the relevant
artifact author. The author determines whether it needs fixing by choosing an option for
“Resolution” as displayed in Table 15. Whether an issue is a problem or not is easy to be
determined by querying the “Resolution” of the issue. “Fixed” and “Won’t Fix” mean the
issue is a problem and the other two options mean that it is not.
Table 15. Resolution options in Bugzilla
Resolution Options Instructions in Bugzilla
Fixed If the issue is a problem, after you fix the problem in the artifact, then choose “Fixed”
Won’t Fix
If the issue is a problem, but won’t be fixed for this time, then choose “Won’t Fix” and must provide the clear reason in “Additional Comments” why it can’t be fixed for this time
Invalid If the issue is not a problem then choose “Invalid” and must provide a clear reason in “Additional Comments”
WorksForMe If the issue really works fine, then choose “WorksForMe” and let the IVVer review this again
53
4.3. Results
Various measures in Table 16 are used to compare the performance of 2011, 2010
years’ value-based and 2009 value-neutral review process. The main goal of the Value-
based review or inspection is to increase the review cost effectiveness as defined in
Chapter 3.
Table 16. Review effectiveness measures
Measures Details
Number of Concerns The number of concerns found by reviewers
Number of Problems The number of problems found by reviewers
Number of Concerns per reviewing hour The number of concerns found by reviewers per reviewing hour
Number of Problems per reviewing hour The number of problems found by reviewers per reviewing hour
Review Effort Effort spent on all activities in the package review
Review Effectiveness of total Concerns
As defined in Chapter 3 but for concerns
Review Effectiveness of total Problems
As defined in Chapter 3 but for problems
Average of Impact per Concern Review Effectiveness of total Concerns/ Number of Concerns
Average of Impact per Problem Review Effectiveness of total Problems/ Number of Problems
Review Cost Effectiveness of Concerns
As defined in Chapter 3 but for concerns
Review Cost Effectiveness of Problems
As defined in Chapter 3 but for problems
Table 17 to Table 22 list the three years’ 35 teams’ performances on different
measures for concerns, and problems’ data is similar and is not listed here due to page
limitation. Mean and Standard Deviation values are calculated at the bottom of each
measure.
54
Table 17. Number of Concerns
2011 Teams 2010 Teams 2009 Teams
T-1 180 T-1 141 T-1 58
T-3 82 T-2 198 T-2 45
T-4 138 T-3 53 T-3 102
T-5 211 T-4 33 T-4 87
T-6 38 T-5 60 T-5 32
T-7 78 T-6 116 T-6 58
T-8 117 T-7 98 T-7 103
T-9 163 T-8 94 T-8 119
T-10 80
T-9 157
T-11 148
T-10 61
T-12 58
T-11 108
T-13 147
T-12 41
T-14 44
T-13 34
T-14 33
Mean 114.15 Mean 99.13 Mean 74.14
Stdev 54.99 Stdev 53.28 Stdev 38.75
55
Table 18. Number of Concerns per reviewing hour
2011 Teams 2010 Teams 2009 Teams
T-1 4.81 T-1 2.79 T-1 0.81
T-3 1.86 T-2 3.07 T-2 1.25
T-4 5.17 T-3 1.22 T-3 2.15
T-5 7.54 T-4 1.12 T-4 1.43
T-6 1.10 T-5 1.08 T-5 0.79
T-7 2.41 T-6 3.02 T-6 1.17
T-8 3.74 T-7 2.89 T-7 1.46
T-9 6.15 T-8 1.46 T-8 2.08
T-10 4.88 T-9 2.18
T-11 7.22 T-10 1.14
T-12 2.32 T-11 1.60
T-13 5.08 T-12 1.53
T-14 1.90 T-13 0.75
T-14 0.69
Mean 4.17 Mean 2.08 Mean 1.36
Stdev 2.12 Stdev 0.93 Stdev 0.52
56
Table 19. Review Effort
2011 Teams 2010 Teams 2009 Teams
T-1 37.44 T-1 50.5 T-1 71.2
T-3 44.06 T-2 64.6 T-2 36.1
T-4 26.69 T-3 43.5 T-3 47.5
T-5 27.98 T-4 29.5 T-4 61
T-6 34.6 T-5 55.35 T-5 40.5
T-7 32.4 T-6 38.4 T-6 49.5
T-8 31.25 T-7 33.95 T-7 70.5
T-9 26.5 T-8 64.3 T-8 57.2
T-10 16.4 T-9 72
T-11 20.5 T-10 53.5
T-12 25 T-11 67.5
T-13 28.95 T-12 26.85
T-14 23.1 T-13 45.5
T-14 48
Mean 28.84 Mean 47.51 Mean 53.35
Stdev 7.30 Stdev 13.37 Stdev 13.97
57
Table 20. Review Effectiveness of total Concerns
2011 Teams 2010 Teams 2009 Teams
T-1 888 T-1 790 T-1 242
T-3 396 T-2 872 T-2 186
T-4 527 T-3 233 T-3 334
T-5 1153 T-4 147 T-4 349
T-6 139 T-5 233 T-5 151
T-7 331 T-6 480 T-6 186
T-8 487 T-7 404 T-7 486
T-9 811 T-8 406 T-8 422
T-10 333
T-9 631
T-11 646
T-10 229
T-12 226
T-11 442
T-13 562
T-12 160
T-14 191
T-13 133
T-14 137
Mean 514.62 Mean 445.63 Mean 292
Stdev 297.92 Stdev 263.08 Stdev 155.05
58
Table 21. Average of Impact per Concern
2011 Teams 2010 Teams 2009 Teams
T-1 4.93 T-1 5.60 T-1 4.17
T-3 4.83 T-2 4.40 T-2 4.13
T-4 3.82 T-3 4.40 T-3 3.27
T-5 5.46 T-4 4.45 T-4 4.01
T-6 3.66 T-5 3.88 T-5 4.72
T-7 4.24 T-6 4.14 T-6 3.21
T-8 4.16 T-7 4.12 T-7 4.72
T-9 4.98 T-8 4.32 T-8 3.55
T-10 4.16
T-9 4.02
T-11 4.36
T-10 3.75
T-12 3.90
T-11 4.09
T-13 3.82
T-12 3.90
T-14 4.34
T-13 3.91
T-14 4.15
Mean 4.36 Mean 4.42 Mean 3.97
Stdev 0.54 Stdev 0.52 Stdev 0.44
59
Table 22. Cost Effectiveness of Concerns
2011 Teams 2010 Teams 2009 Teams
T-1 23.72 T-1 15.64 T-1 3.40
T-3 8.99 T-2 13.50 T-2 5.15
T-4 19.75 T-3 5.36 T-3 7.03
T-5 41.21 T-4 4.98 T-4 5.72
T-6 4.02 T-5 4.21 T-5 3.73
T-7 10.22 T-6 12.50 T-6 3.76
T-8 15.58 T-7 11.90 T-7 6.89
T-9 30.60 T-8 6.31 T-8 7.38
T-10 20.30
T-9 8.76
T-11 31.51
T-10 4.28
T-12 9.04
T-11 6.55
T-13 19.41
T-12 5.96
T-14 8.27
T-13 2.92
T-14 2.85
Mean 18.66 Mean 9.30 Mean 5.31
Stdev 10.94 Stdev 4.53 Stdev 1.86
Table 23 compares the Mean and Standard Deviation values for all the measures
between the three-year teams. To determine whether the differences between years based
on a measure is statistically significant or not, Table 24 compares every two years’ data
using the F-test and T-test. The F-test determines whether two samples have different
variances. If the significance (p-value) for F-test is 0.05 or below, the two samples have
different variances. This will determine which type of T-test will be used to determine
whether the two samples have the same mean. Two types of T-test are: Two-sample equal
variance (homoscedastic), and Two-sample unequal variance (heteroscedastic). If the
60
significance (p-value) for T-test is 0.05 or below, the two samples have different means.
For example, Table 24 shows that 2010’s value-based review teams had a 75.04% higher
Review Cost Effectiveness of Concerns than 2009’s value-neutral teams. The p-value for
F-test 0.0060 leads to choose “Two-sample unequal variance” type T-test. The p-value for
T-test 0.0218 is strong evidence (well below 0.05) that the 75.04% improvement has
statistical significance, the similar for its comparison between 2011 and 2009 (with F-test
0.0000, and T-test 0.0004), which rejects the hypothesis H-r1.
Table 23. Data Summaries based on all Metrics
2011 Team 2010 Team 2009 Team
Mean Stdev Mean Stdev Mean Stdev
Number of Concerns 114.15 54.99 99.13 53.28 74.14 38.75
Number of Problems 108.62 52.81 93.38 52.96 68.79 35.35
Number of Concerns per reviewing hour 4.17 2.12 2.08 0.93 1.36 0.52
Number of Problems per reviewing hour 3.96 2.04 1.96 0.92 1.26 0.48
Review Effort 28.84 7.30 47.51 13.37 53.35 13.97
Review Effectiveness of total Concerns 514.62 297.92 445.63 263.08 292.00 155.05
Review Effectiveness of total Problems 491.85 287.84 416.25 254.15 272.07 141.78
Average of Impact per Concern 4.36 0.54 4.42 0.52 3.97 0.44
Average of Impact per Problem 4.37 0.57 4.37 0.52 3.99 0.45
The first row TP (RRL) in Table 34 shows the testing order we followed to do this
testing by first testing the scenario with higher RRL. This order enabled us to focus the
limited effort on testing more frequently used scenarios with higher risk probability to fail,
and supposed to improve the testing efficiency especially when the testing time and
resource is limited. The testing results by using the value-based testing prioritization
strategy are shown in Table 35 and Table 36. Due to the schedule constraint, and
according to the TP order, we didn’t do thorough test on WinXP (x32) Virtual Machine
working on host of Vista (x32) and Vista (x64) host machine, since they both has the
lowest frequency of use, they can be ignorable for testing if the time runs out. For Win7
(x32), although it is never tested, it is supposed to pass since its Virtual Machine copy,
which is supposed to have even lower performance, has passed the testing. Besides, if we
installed a Win 7 (x32) on a host machine to test, this will cause more time, and we
couldn’t finish other scenario testing which has higher TP and won’t require installing a
new OS before testing. Therefore, the testing strategy combines the considerations of all
critical factors and makes the testing results optimal under scarce testing resources.
77
Table 35. Testing Results
Local Installation
Host Machine Virtual Machine working on the host on the same row
WinXP (x32)
pass Vista (x32) pass
Win7 (x64) pass
WinXP (x32)
pass
Win7 (x32) pass
Vista (x32) pass
Vista (x32) pass WinXP (x32)
Never test, we are running out of time, FU is the lowest, no need to test when the testing time is limited
Vista (x64) Never test, we even don’t have VM for this, besides, we are running out of time, FU is the lowest, no need to test when the testing time is limited
Win7 (x32) Never test, we don’t have a host machine for this, but supposed to pass, since its VM has passed
Table 36. Testing Results (continued)
Server Installation
Win 7 (64)
WinServer 2003x32 pass
WinServer 2008x64 pass
WinServer 2008x32 pass
Figure 16 shows the results of value-based testing prioritization compared with
two other situations which might be common in testing planning as well. The three
situations for comparison are:
Situation 1: value-based testing prioritization strategy: this situation is exactly
what we did for the macro testing in Galorath, Inc., using the value-based scenario testing
strategy. We followed the Testing Priority (TP) to do the testing. Since our testing time is
limited, we had to stop testing when the Accumulated Cost (AC) reached 18 units as
shown in Figure 16. At this point, Percentage of Business Importance Earned (PBIE) is as
high as 93.83%;
Situation 2: Reverse of value-based, risk-driven testing strategy: this situation’s
testing order is reversed from Situation 1; when the AC reaches 18 units, PBIE is only
78
22.22%; this is the worst case, but this might be a common value-neutral situation in
reality as well.
Situation 3: The prioritization in Situation 1 considers all variables into the value-
based testing prioritization: not only prioritizes various operating systems, but also
prioritizes different products and different installation types. However, in the situation 3,
we do a partial value-based prioritization: we still prioritize products and operating
systems, but we assume that the installation type is equally important, so the client
installation type which has been proved to be defect-free should also be tested. The results
show a significant difference: when AC reaches 18 units, PBIE is only 58.02%; much of
the testing effort is wasted on testing the defect-free type. In fact, this “partial” value-
based prioritization is common in practice: testing managers often do prioritize tests in
practice, but the way they prioritize is often intuitive, and tends to ignore some factors into
prioritization, so this situation can represent most common situations in practice as well.
Since this situation still treats all installation types equally important, we still consider it as
a value-neutral one to differentiate the “complete, systematic, comprehensive and
integrated” value-based prioritization in Situation 1.
79
Figure 16. Comparison among 3 Situations
74.07%
77.78%
83.95%
90.12% 93.83%
95.06% 100.00%
4.94% 6.17% 9.88%
16.05% 22.22%
25.93%
58.02%
35.80%
39.51%
45.68%
51.85%
58.02% 61.73%
87.65%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
8 10 12 14 16 18 20 22 24 26 28 30
PBIE-1 PBIE-2 PBIE-3
Stop
Table 37 compares APBIE of the three situations, and it is obvious that value-
based testing prioritization is the best in terms of APBIE. The case study in Galorath, Inc.
validates that the added value-based prioritization can improve the scenario testing’s cost-
effectiveness in terms of APBIE.
Table 37. APBIE Comparison
Comparison APBIE
Situation 1 (Value-based) 70.99%
Situation 2 (Inverse Order) 10.08%
Situation 3 (Value-neutral) 32.10%
Other value-neutral (or partial value-based) situations’ PBIE curves are supposed
to lie between the Situation 1 and Situation 2 in Figure 16, and are representative of the
most common situations in reality. From the comparative analysis, we can reject the
80
hypothesis H-t1 which means that value-based prioritization can improve the testing cost-
effectiveness.
5.4. Lessons Learned
Integrate and leverage the merits of state-of-art effective test prioritization
techniques: in this paper, we synthetically incorporated the merits of various test
prioritization techniques to maximize the testing cost effectiveness, i.e. coverage-based,
defect proneness-driven and most important incorporated the business value into the
testing prioritization. Value-based testing strategy introduced in this paper is not
independent of other prioritization techniques; on the contrary, it is the synthesis of all the
merits from other techniques with a focus on bridging the gap between business or
mission value from customers and the testing process.
Think more on trade-offs for automated testing at the same time: form our
experiences in this case study to establish automated testing at Galorath, Inc., we can also
see that establishing automated testing is a high risk as well as a high investment project
[Bullock, 2000]. The test automation is also software development, which might be also
expensive, fault-prone, and facing evolving and maintenance problems. Furthermore,
automated testing usually treats every scenario equally important.
However, the combination of value-based test prioritization and automated testing
might be a promising strategy and can even further improve the testing cost-effectiveness.
For example, adopting the value-based test case prioritization strategy can shrink the
testing scope by 60%, the remaining tedious manual testing effort can be further replaced
by an initial little investment to write some automated scripts to allow testing run by
computer programs overnight and save human effort by 90%, so by the strategy of
81
combining value-based test case prioritization and automated testing, the cost is reduced to
(1-60%)*(1-90%)=4% with a factor of 25’s RRL improvement. Anyway, this is also a
trade-off question among how much automated testing is enough based on its saving and
investment to establish.
In fact, any testing strategy has its own advantages; the most important for testing
practitioners is having a strong sense of combining the merits of these testing strategies to
continuously improve the testing process.
Team work is recommended to determine ratings. Prioritization factors’ ratings,
i.e. ratings of business importance, risk probability, testing cost, should not only
determined by a single person, this might introduce subjective bias which might cause the
prioritization misleading. Ratings should be discussed and brainstormed at team meetings
when more stakeholders involved to acquire more comprehensive information, resolve
disagreements and negotiate to consensus. For example, if we didn’t send out the
questionnaire to get the frequency of use of each scenario, we would treat all scenarios
equally important and couldn’t finish all the testing in a limited time. The worst situation
is that we installed some operating system sceneries that were seldom used and tested the
macros on them and finally found that it was no need to test them. The same for risk
probability: if we didn’t know that Client installation would not needed to test because it
seldom failed before and supposed to be defect-free, amount of testing effort would be put
on this unnecessary testing. So team work to discuss and understand the project under test
is very important to determine the testing scope and testing order.
Business case analysis is based on project contexts: from these empirical studies
so far, the most difficult, yet flexible part is how to determine the business importance for
82
the testing items via business case analysis: The business case analysis can be
implemented with various methods, considering their ease of use and adaption under
experiments’ environment. For example, in this case study of value-based testing scenario
prioritization, we use frequency of use (FU) combined with product importance as a
variant of business importance for operational scenarios; in the case study of value-based
feature prioritization for software testing in Chapter 5, Karl Wiegers’ requirement
prioritization approach [Wiegers, 1999] is adopted, which considers both the positive
benefit of the presence of a feature and the negative impact of its absence. In the case
study of value-based test case prioritization in Chapter 7, classic S-curve production
function with segments of investment, high-payoff, and diminishing returns [Boehm, 1981]
are used to train students for their project features’ business case analysis with the Kano
model [Kano] as a reference to complement their analysis for feature business importance
ratings. Test cases’ business importance is then determined by its corresponding functions,
components or features’ importance, and test cases’ usage, whether testing the core
function of this feature or not As for the case study of determining the priority of artifacts
(system capabilities) in Chapter 3, the business importance is tailored to ratings of their
influences/impacts to the project’s success. The similarity for these different business case
analyses is that all using well-defined, context-based relative business importance ratings.
Additional prioritization effort is a trade-off as well: Prioritization can be as easy
as in this case study or can be more deliberate. Too much effort on prioritization might
bring diminishing testing cost-effectiveness. “How much is enough” depends on the
project context and how easily we can get that information required for prioritization. It
should be kept in mind all the time that value-based testing prioritization aims at saving
83
effort, rather than increasing effort. In this case study, the information required for this
prioritization is from expert estimation (project managers, product manager and project
developers) with little cost, yet generate high pay-offs for the limited testing effort.
However, for this method’s application on large-scale projects which might have
thousands of test items to be prioritized, there has to be a consensus mechanism to collect
all the data. We started to implement an automatic way to support this method’s
application on large-scale industrial projects. This automation is designed to support
establishing the traceability among requirements, code, test cases and defects, so business
importance ratings for requirements can be reused for test items, the code’ change and
defect data can be used for predicting risk probability. The automation will also
experiment the sensitivity analysis on judging the correctness of ratings and how the
rating’s change can impact the testing order. The automation is supposed to generate
recommend ratings in order to save effort and provide reasonable ratings as well to
facilitate value-based testing prioritization.
84
Chapter 6: Case Study III-Prioritize Software Features to be functionally Tested
6.1. Background
This case study to prioritize features for testing was implemented at the system and
acceptance testing phase of one of an industry product’s (named “Qone” [Qone]) main
releases in a Chinese Software Organization. The release under test added nine features
with total Java codes of 32.6 KLOC in this release. The features are mostly independent
amendments or patches of some existing modules. The value-based prioritization strategy
was also applied to prioritize the 9 features to be tested based on their ratings of business
importance, Quality Risk Probability, and Testing Cost. Features’ testing value priorities
provide the decision support for the testing manager to enact the testing plan and adjust it
according to the feedback of quality risk indicators, such as defects numbers and defects
density and updated testing cost estimation. Defects data was collected automatically and
displayed real-time by this organization’s defect reporting and tracking system with
immediate feedback to adjust the testing priorities for the next testing round.
6.2. Case Study Design
6.2.1. The step to determine Business Value
To determine business importance of each feature, Karl Wiegers’ approach
[Wiegers, 1999] is applied in this case study. This approach considers both the positive
benefit of the presence of a feature and the negative impact of its absence. Each feature is
assessed in terms of the benefits it will bring if implemented, as well as the penalty that
will be incurred if it is not implemented. The estimates of benefits and penalties are
relative. A scale of 1 to 9 is used. For each feature, the relative benefit and penalty are
85
summed up and entered in the Total BI (Business Importance) column in Table 38 using
the following formula.
The sum of the Total BI column represents the total BI of delivering all features.
To calculate the relative contribution of each feature, divide its total BI by the sum of the
Total BI column. As we can see, there is an approximate Pareto distribution in which F1
and F2 contribute 22.2% of the features and 59.3% of the total BI.
Table 38. Relative Business Importance Calculation
Benefit Penalty Total BI BI %
Weights 2 1
F1 9 7 25 30.9%
F2 8 7 23 28.4%
F3 1 3 5 6.2%
F4 2 1 5 6.2%
F5 1 1 3 3.7%
F6 2 1 5 6.2%
F7 3 2 8 9.9%
F8 1 2 4 4.9%
F9 1 1 3 3.7%
SUM 28 25 81 1
Figure 17 shows the BI distribution of the 9 features. As we can see, there is an
approximate Pareto distribution in which F1 and F2 contribute 22.2% of the features and
59.2% of the total BI.
86
Figure 17. Business Importance Distribution
6.2.2. The step to determine Risk Probability
The risk analysis was performed prior to system testing start, but was
continuously updated during testing execution. It aims to calculate the risk probability for
each feature. We follow the four steps:
Step 1: List all risk factors based on past projects and experiences: set up the
n risks in the rows and columns of an n*n matrix. In our case study, according to this
Chinese organization’s past similar projects’ risk data. Four top quality risk factors with
the highest Risk Exposure are: Personnel Proficiency, Size, Complexity, and Design
Quality. Defects Proportion and Defects Density are usually used as hand-on metrics for
quality risk identification during the testing process and they together with the top four
quality risk factors to serve as the risk factors that would determine the feature quality
risk in this case study.
Step 2: Determine risk weights according to their impact degree to software
quality: different risk factor has different impact degrees to influence software quality
under different organizational contexts, and it is more reasonable to assign them different
30.9% 28.4%
6.2% 6.2% 3.7%
6.2% 9.9%
4.9% 3.7%
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
F1 F2 F3 F4 F5 F6 F7 F8 F9 Business Importance
87
weights before combining them to get one risk probability number for each feature. AHP
(The Analytic Hierarchy Process) Method [89], a powerful and flexible multi-criteria
decision-making method that has been applied to solve unstructured problems in a variety
of decision-making situations, ranging from the simple personal decisions to the complex
capital intensive decisions, is used to determine the weight for each risk factor. Based on
the understanding of risk factors and their knowledge and experience of their specific
relative impact degree to software quality in this organization’s context, the testing
manager collaborated with the developing manager to determine the weights of each
quality risk using AHP method.
In this case study, the calculation of quality risks weights is illustrated in Table
39. The number in each cell represents the value pair-wise relative importance: number of
1, 3, 5, 7, or 9 in row i and column j stands for that the stakeholder value in row i is
equally, moderately, strongly, very strongly, and extremely strongly more important than
the stakeholder value in column j, respectively. In order to calculate weight, each cell is
divided by the sum of its column, and then averaged by each row. The results of the final
averaged weight are listed in the bolded Weights column in Table 39. The sum of weights
equals 1.
If we are able to determine precisely the relative value of all risks, the values
would be perfectly consistent. For instance, if we determine that Risk1 is much more
important than Risk2, Risk2 is somewhat more important than Risk3, and Risk3 is slightly
more important than Risk1, an inconsistency has occurred and the result’s accuracy is
decreased. The redundancy of the pairwise comparisons makes the AHP much less
sensitive to judgment errors; it also lets you measure judgment errors by calculating the
88
consistency index (CI) of the comparison matrix, and then calculating the consistency
ratio (CR). As a general rule, CR of 0.10 or less is considered acceptable [Saaty, 1980].
In the case study, we calculated CR according to the steps in [Saaty, 1980], and the CR is
0.01, which means that our result is acceptable.
Table 39. Risk Factors’ Weights Calculation-AHP
Personnel Proficiency
Size Complexity Design Quality
Defects Proportion
Defects Density
Weights
Personnel Proficiency
1 1/3 3 3 1/3 1/5 0.09
Size 3 1 3 3 1 1 0.19
Complexity 1/3 1/9 1 1 1/7 1/9 0.03
Design Quality
1/3 1/7 1 1 1/7 1/9 0.04
Defects Proportion
3 1 7 7 1 1 0.27
Defects Density
5 3 9 9 1 1 0.38
Step 3: Score each risk factor for each feature: the testing manager’s in
collaboration with the developing manager scores each risk factor for each feature. The
estimate is of the degree to which the risk factor is present for each feature. 1 means the
factor is not present and 9 means the factor is very strong. A distinction must be made
between factor strength and action to be taken. 9 indicates factor strength, but does not
indicate what should be done about it.
Initial Risks are risk factors we use to calculate the risk probability before the
system testing and Feedback Risks such as Defects Proportion and Defects Density are
risk indicators used during the testing process and serve to monitor and control the testing
process.
89
Risks such as Personnel Proficiency, Complexity, and Design Quality etc. are
scored by the developing manager based on their understanding of each feature and pre-
defined scoring criteria. The organization also has its own defined scoring cr iteria for
each risk rating. For example, for Personnel Proficiency, Years of experience in
application, platform, language and tool serves as a surrogate for simply measuring it, the
scoring criteria the organization adopts are as follows:
1-More than 6 years, 3-More than 3years,
5-More than 1 year, 7-More than 6 months, 9-<2 months
Use of intermediate scores (2, 4, 6, 8) was allowed
More comprehensive measures for Personnel Proficiency could be a combination
of COCOMO II [Boehm et al. , 2000] personnel factors, e.g. ACAP (Analyst Capability),
PCAP (Programmer Capability), PLEX (Platform Experience), LTEX( Language and
Tool Experience) with other outside factors that might influence Personnel Proficiency,
e.g. reasonable workload, and work spirit and passion from psychological view.
Risks such as Size, Defects Proportion, Defects Density are scored based on
collected data, for example, if a feature’s size is 6KLOC and the largest feature’s size is
10KLOC, so the feature’s size risk is scored as 9*(6/10) 5.
Step 4: Calculate the risk probability for each feature: for each feature Fi,
after each risk factor’ score is obtained, following formula is used to combine all the risk
factors to get the risk probability Pi of Fi
90
jiR , is Fi’s risk value of jth risk factor, jW denotes the weight of jth risk factor.
Table 40 will calculate the Probability of the total initial risks that comes from each
feature before system test.
Table 40. Quality Risk Probability Calculation (Before System Testing)
Initial Risks Feedback Risks
Probability Personnel Proficiency
Size Complexity Design Quality
Defects Proportion
Defects Density
Weights 0.09 0.19 0.03 0.04 0.27 0.38
F1 5 3 1 1 0 0 0.13
F2 4 9 5 2 0 0 0.26
F3 3 3 5 5 0 0 0.14
F4 5 4 7 5 0 0 0.19
F5 5 2 3 3 0 0 0.12
F6 5 2 5 6 0 0 0.14
F7 5 4 5 2 0 0 0.17
F8 1 2 1 1 0 0 0.06
F9 1 1 1 1 0 0 0.04
Lessons Learned and Process Implication:
From the data of initial risks collected, some potential problems are found for this
organization:
Potential problem in tasks break down and allocation: the Feature F9 has the
least risks of both Personnel Proficiency and Complexity and it implies that one of the
most experience developers is responsible for the least complex feature. But for the most
complex feature F4, it is developed by the least experienced developer. This implies a
potential task allocation problem in this organization. Generally, it is highly risky to let
91
the least experienced staff to do the most complex task and also a resource waste to let
the most experienced developer to do the least complex task. In the future, the
organization should consider a more reasonable and efficient task allocation strategy to
mitigate risk.
Potential insufficient design capability: basically, the risk factors should be
independent when they are combined to generate a risk probability, which means that the
risk factors should not have strong interrelation among them. Based on the data from
Table 40, we do a correlation analysis among the risk factors, almost all risk factors don’t
have strong correlations (correlation coefficient>0.8). But it should be noted that the
correlation coeffic ient 0.76 between Complexity and Design Quality is high, which
means as the Complexity becomes an issue, the Design Quality also becomes a risky
problem. This could imply that the current designers or analysts are inadequate for their
work. To mitigate this risk, the project manager should consider recruiting analysts with
more requirements, high-level design and detailed design experiences in the future.
Table 41. Correlation among Initial Risk Factors:
Personnel Proficiency
Size Complexity Design Quality
Personnel Proficiency 1
Size 0.30 1
Complexity 0.56 0.48 1
Design Quality 0.44 -0.05 0.76 1
From Table 39, we could see that feedback risk factors: “Defect Proportion” and
“Defect Density” have the largest weights when they use AHP to determine the risk
items’ weights. This is reasonable, because initial risk factors are mainly used to estimate
92
the risk probability before system testing starts. As long as system testing starts, the
testing manager should be more concerned with each feature’s real and undergoing
quality situation to find which features are the most fault-prone. “Defect Proportion” and
“Defect Density” could serve to provide the real quality information and feedback during
the process of system testing. This is also the reason that probabilities in Table 40 are
low, since the initial risks are assigned smaller weights and there are no feedback risk
factors before system testing starts.
6.2.3. The step to determine Testing Cost
The test manager estimates the relative cost of testing each feature, again on a
scale ranging from a low of 1 to a high of 9. The test manager estimates the cost ratings
based on factors such as the developing effort of the feature, the feature complexity, and
the quality risks as shown in Table 42.
Table 42. Relative Testing Cost Estimation
Cost Cost%
F1 2 4.8%
F2 5 11.9%
F3 5 11.9%
F4 9 21.4%
F5 6 14.3%
F6 4 9.5%
F7 5 11.9%
F8 3 7.1%
F9 3 7.1%
sum 42 1
93
Figure 18. Testing Cost Estimation Distribution
A correlation analysis is done between the 9 features’ business importance and
estimated testing cost as shown in Table 43. The negative correlation denotes that the
most testing costly features might have less business importance to key customers.
Testing the features first with more business importance but less cost will definitely
improve the testing efficiency and maximize its ROI at the early stage of testing phase.
Table 43 Correlation between Business Importance and Testing Cost
BI Cost
BI 1
Cost -0.31 1
6.2.4. The step to determine Testing Priority
Similar as the scenario prioritization, after passing the testing for each feature, the
probability of failure would be reduced to 0, so the testing priority (TP) triggered by RRL
is calculated as:
4.8%
11.9% 11.9%
21.4%
14.3%
9.5%
11.9%
7.1% 7.1%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
F1 F2 F3 F4 F5 F6 F7 F8 F9
Cost
94
And the Testing Priorities for the 9 features are shown in Table 44, the testing order is F1,
F2, F7, F6, F3, F4, F8, F5, and F9.
Table 44. Value Priority Calculation
BI % Probability Cost% Priority
F1 30.9 0.13 4.8 0.81
F2 28.4 0.26 11.9 0.63
F7 9.9 0.17 11.9 0.14
F6 6.2 0.14 9.5 0.09
F3 6.2 0.14 11.9 0.07
F4 6.2 0.19 21.4 0.05
F8 4.9 0.06 7.1 0.04
F5 3.7 0.12 14.3 0.03
F9 3.7 0.04 7.1 0.02
6.3. Results
After adapting the value-based prioritization strategy to determine the testing order
of the 9 features, the PBIE comparison between value-based order and its inverse order
(the most inefficient one) is shown in Figure 19 , and the difference of APBIE between the
two is 76.9%-34.1%=42.8% which means value-based testing order can improve the cost-
effectiveness by 42.8% than the worst case, other value-neutral (or partial value-based)
situations’ PBIE curves are supposed to lie between the these two PBIE curves, and are
representative of the most common situations in reality, and this further rejects hypothesis
H-t1.
95
Figure 19. Comparison between Value-Based and Inverse order
In our case study, the test manger plans to execute 4 rounds of testing. During
each round, test groups focus on 2-3 features with the highest current priority, and the
other features are tested by automated tools. The testing result is when the first round is
over, F1 and F2 satisfy the stop-test criteria, when the second round is over, F3, F6, F7
satisfied the stop-criteria, when the third round is over, F4, F8 satisfied the stop-test
criteria, and the last round is F5 and F9. And initial estimating testing cost and actual
testing cost comparison can be shown in Figure 20.
Figure 20. Initial Estimating Testing Cost and Actual Testing Cost Comparison
30.8%
59.2%
69.1%
75.3%
81.4%
87.6% 92.5%
96.2% 99.9%
3.7% 7.4%
12.3%
18.5%
24.7%
30.9%
40.7%
69.1%
99.9%
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
1 2 3 4 5 6 7 8 9
PBIE
Features Value-Based Inverse
16.7
33.3 28.6
21.4 19.8 25.3
30.3 24.6
0.0 5.0
10.0 15.0 20.0 25.0 30.0 35.0
1 2 3 4
Co
st(P
erc
ent)
Testing Rounds
Estimate Actual
96
If we regard the testing activity as an investment, its value is realized when
features satisfy the stop-test criteria. The accumulated BI earned curve in Figure 22 is
like a production function, with higher pay-off at the earlier stage but diminishing return
later. Also from Figure 21 and Figure 22, we can see that when we finished the Round 1
testing, we earned 59.2% BI of all features, at a cost of only 19.8% of the all testing
process, the ROI is as high as 1.99. During the Round 2, we earned 22.2% BI, cost 25.3%
effort, and the ROI became negative as -0.12. We also can see, from Round 1 to Round 4,
both the BI earned line and the ROI line is descending. Round 3 and Round 4 earn only
18.5% BI but cost 54.9% effort. This shows that the Round 1 testing is the most cost
effective. Testing the features with higher value priority first is especially useful when the
market pressure is very high. In such cases, one could stop testing after finding a negative
ROI in Round 1. However, in some cases, continuing to test may be worthwhile in terms
of customer-perceived quality.
Figure 21. BI, Cost and ROI between Testing Rounds
Start Round 1 Round 2 Round 3 Round 4
BI Earned 0 59.2 22.2 11.1 7.4
Cost 0 19.8 25.3 30.3 24.6
Test_ROI 0 1.99 -0.12 -0.63 -0.70
-1
-0.5
0
0.5
1
1.5
2
0
14
28
42
56
70
97
Figure 22. Accumulated BI Earned During Testing Rounds
Consideration of Market Factors
Time to market can strongly influence the effort distribution of software
developing and project planning. As testing phase serves as the adjacent phase before
software product transition and delivery, it will be influenced even more by market
pressure [Huang and Boehm, 2006]. Sometimes, under the intense market competition
situation, sacrificing some software quality to avoid more market share erosion might be
a good organizational strategy.
In our case study, we use a simple function as follows to display the market
pressure’s influence to Business Importance:
Time represents the number of unit time cycle. A unit time cycle might be a year,
a month, a week even a day. For simplicity, in our case study, the unit time cycle is a
testing round. Pressure Rate is estimated and provided by market or product managers,
with the help of customers. It represents during a unit time cycle, what is the percentage
0
59.2
81.4
92.5 99.9
0
20
40
60
80
100
Start Round 1 Round 2 Round 3 Round 4
BI Earned
98
initial value of the software will depreciated. The more furious the market competition is,
the larger the Pressure Rate is. As we can see from the formula above, the longer the time
is, the larger the Pressure Rate is, the smaller is the present BI, and the larger the loss BI
caused by market erosion. In our case study, Due to we calculate the relative business
importance, the initial total BI is 100(%). When the Round n testing is over, the loss BI
caused by market share erosion is
. On the other hand, the earlier
the product enters the market, the larger the loss caused by poor quality. Finally, we can
find a sweet spot (the minimum) from the combined risk exposure due to both
unacceptable software quality and market erosion.
We assume three Pressure Rates 1%, 4% and 16% standing for low, medium and
high market pressure respectively in Figure 23 to Figure 25, and this could be also seen
as three types of organizational contexts: high finance, commercial and early start-up
[Huang and Boehm, 2006]. When market pressure is as low as 1% in Figure 23, the total
loss caused by quality and market erosion reaches the lowest point (sweet spot) at the end
of the Round 4.When the Pressure Rate is 4%, the lowest point of total loss is at the end
of Round 3 in Figure 24, which means we should stop testing and release this product
even F5 and F9 haven’t reached the stop-test criteria at the end of Round 3; this would
ensure the minimum loss. When the market pressure rate is as high as 16% in Figure 25,
we should stop testing at the end of Round 1.
99
Figure 23. BI Loss (Pressure Rate=1%)
Figure 24. BI Loss (Pressure Rate=4%)
Figure 25. BI Loss (Pressure Rate=16%)
100
Extension of Testing Priority Value Function:
In this case study, we use multi-objective multiplicative value function to
determine the testing priority. There is also another additive value function that can be
used to determine the testing priority as follows:
V(XBI), V(XC) and V(XRP) are single value functions for “Business Importance”,
“Cost” and “Risk Probability”. WBI, WC and WRP are relative weights for them
respectively. V(XBI+XC+XRP) is the multi-objective additive value function for testing
priority. For the single value functions of “Business Importance” and “Risk Probability”,
they are increasing preference, the larger the “Business Importance” or “Risk
Probability”, the higher the testing priority as shown in the left part of Figure 26. For the
single value function of “Testing Cost”, it is decreasing preference, the larger the Cost,
the lower the testing priority value as shown in the right part of Figure 26.
Figure 26. Value Functions for “Business Importance” and “Testing Cost”
Extension from the multiplicative value function to additive one also shows the
similar feature testing priorities result [Li, 2009]. No matter the value function is
multiplicative or additive, as long as they reasonably reflect the similar SCSs’ win
101
condition’ preferences, they are supposed to generate the similar priority results. From
our extension experiment, both dynamic prioritizations could make the ROI of testing
investment reach the peak at the early stage of testing, which is especially effective when
the time to market is limited. This extension of value function is also supported by Value-
Based Utility Theory.
102
Chapter 7: Case Study IV-Prioritize Test Cases to be Executed
7.1. Background
This case study for prioritizing test cases to be executed by using the Value-Based,
Dependency-Aware prioritization strategy was experimented on USC 2011 spring and fall
semester software engineering course’s a number of 18 projects. As an extension to
previous work for prioritizing testing features, this work prioritized test cases in a fine-
grained granularity with added considerations on test cases’ inner-dependency. Besides, it
tailored the Probability of Loss from the Risk Reduction Leverage (RRL) definition to test
case Failure Probability and used this as a trigger to shrink the regression test case suite by
excluding the stable features for the scarce testing resource.
A project named “Project Paper Less” [USC_577b_Team01, 2011] with 28 test
cases is used as an example to investigate the improved testing efficiency.
Through Fall 2010 CSCI 577a, Team01 students have already developed good
results of Operation Concept Description (OCD), System and Software Requirement
Description (SSRD), System, System and Software Architecture Description (SSAD) and
Initial Prototype together with various planning documents, such as Lifecycle Plan (LCP),
Quality Management Plan (QMP). From Spring 2011 CSCI 577b, they develop Initial
Operational Capability with concurrently generating Test Plan and Cases (TPC), students
are trained to write test cases according to the requirements in SSRD with Equivalence
Partitioning and Boundary Value Testing techniques [Ilene, 2003] to elaborate test cases.
Their test cases in the TPC cover 100% requirements in the SSRD and they have already
done some informal unit testing, integration testing before the acceptance testing. They
103
follow the Value-based Testing Guideline [USC_577b_VBATG, 2011] to do Value-based
test case prioritization (TCP), execute their acceptance testing according to the testing
order from the prioritization, record their testing results in the Value-based Testing
Procedure and Results (VbTPR) and report defects discovered to Bugzilla system
[USC_CSSE_Bugzilla] to report and track those defects until closure. From the next
section, the Value-based TCP steps will be introduced within Team01’s project’s context.
7.2. Case Study Design
7.2.1. The step to do Dependency Analysis
Most features in the SUT are not independent of each other and they typically have
precedence or coupling constraints between them that requires some features must be
implemented before others, or some must be implemented together [Maurice et al., 2005].
Similar for test cases, some test cases are required to be executed and passed before others
can be executed. The failure of some test cases can also block others to be executed.
Understanding the dependencies among test cases would benefit test case prioritization
and test planning; also they are useful information for rating business importance, failure
probability, criticality and even testing cost that will be introduced within the following
sections.
Based on the test cases in TPC [USC_577b_Team01, 2011], testers were asked to
generate dependency graphs for their test suites. They could be as simple as Team01’s test
case dependency tree in Figure 27, or could be much more complex, for example, one test
case node has more than one parental node. In Figure 27, for each test case, the bracket
associated with have two space holders for later filling in, one is for Testing Value
(=Business Importance*Failure Probability/Testing cost) and the other is Criticality. The
104
following sections will introduce in detail how to rate those factors, and use them for
prioritization.
Figure 27. Dependency Graph with Risk Analysis
7.2.2. The step to determine Business Importance
As for testing, the business importance of a test case is mainly determined by its
corresponding functions, components or features’ importance or value to clients. Besides,
due to the test case elaboration strategies, such as Equivalence Partitioning and Boundary
Value Testing, various test cases for the same feature are designed to test different aspects
of the feature with different importance as well. The first step to determine the Business
Importance of a test case is to determine the BI of its relevant function/feature. From
CSCI577a, students are educated and trained on how to do business cases analysis for
software project, and rate relative Business Importance for function/feature in a software
system from the client’s view, such as the importance of software, product, component, or
feature to his/her organization in terms of its Return on Investment [Boehm, 1981] as
shown in Figure 28. A general mapping instruction between function/feature BI rating
range as given in the box in Figure 28. And the range in production function (investment,
high-payoff, diminishing returns) are given to students for their references.
105
Basically the slope of the curve represents the ROI of the function, the higher the
slope, the higher the ROI, so the higher the BI of the function. The BI of the function in
the Investment segment is usually in the range from Very Low to Normal, since the early
Investment segment involves development of infrastructure and architecture which does
not directly generate benefits but which is necessary for realization of the benefits in the
High-payoff and Diminishing returns segments. For “Project Paper Less”, the Access
Control and User Management features should belong to the Investment segment. The
main application functions for this project such as Case Management, Document
Management features are the core capabilities for this system that the client most wants to
have and they are within High-payoff segment, so the BI of those functions are in the
range from High to Very High. Because of the scope and schedule constraints of the
course projects, these projects are usually small-scale and only require students developing
the core capabilities and seldom have some features that belong to Diminishing Return
segment.
Figure 28. Typical production function for software product features [Boehm, 1981]
BI: H-VH
BI: VL-N
BI: VL-N
106
The business importance of a test case is determined by the business importance of
its corresponding feature, function or module on one side, it is also determined by the
criticality magnitude of the failure occurrence on the other side. A guideline for rating a
test case’s Business Importance is shown in Table 45 by considering both two sides. The
ratings for Business Importance are from VL to VH, with corresponding values from 1 to
5. For example, for the Login function in the Access Control module, the tester used
Equivalence Partitioning test case generation strategy to generate two test cases: one is to
test whether a valid user can login, and the other is to test whether an invalid user cannot
login. Since the Access Control feature belongs to “Investment” segment and the tester
rated it as “Normal” benefit to the client. If the first test case to test whether a valid user
can login fails, the Login function won’t run and this will block other functions, such as
Case Management, Document Management, to be tested, so this test case should be rated
“Normal” according to the guideline in Table 45. On the other side, for the other test case
to test whether an invalid user cannot login should be rated “Low”, because if it fails, the
login can still run (the valid user can still login to test other functionalities without
blocking them). So its criticality magnitude is relatively smaller than the first test case and
deserves a relative lower rating “Low”. This is just an example for differentiating the
Business Importance of test cases elaborated by Equivalence Partitioning yet within the
same feature. There are other various cases applicable to differentiate the relative
importance by considering the criticality magnitude of failure occurrence as well.
107
Table 45. Guideline for rating BI for test cases
VH:5 This test case is used to test the functionality that will bring the Very High benefit for the client, without passing it, the functionality won’t run
H:4
This test case is used to test the functionality that will bring the Very High benefit for the client, without passing it, the functionality can still run
This test case is used to test the functionality that will bring the High benefit for the client, without passing, the functionality won’t run
N:3
This test case is used to test the functionality that will bring the High benefit for the client, without passing it, the functionality can still run
This test case is used to test the functionality that will bring the Normal benefit for the client, without passing it, the functionality won’t run
L:2
This test case is used to test the functionality that will bring the Normal benefit for the client, without passing it, the functionality can still run
This test case is used to test the functionality that will bring the Low benefit for the client, without passing it, the functionality won’t run
VL:1
This test case is used to test the functionality that will bring the Low benefit for the client, without passing it, the functionality can still run
This test case is used to test the functionality that will bring the Very Low benefit for the client, without passing it, the functionality won’t run
As a result of rating the total 28 test cases’ Business Importance for “Project Paper
Less”, the ratings’ distribution is shown in Figure 29, High, and Very High business
importance test cases consist more than half. This makes sense because most features
implemented are core capabilities, but still needs some “investment” capabilities that are
necessary for those core ones.
108
Figure 29. Test Case BI Distribution of Team01 Project
7.2.3. The step to determine Criticality
Criticality, as mentioned the above step, represents impact magnitude of failure
occurrence and what influences it will bring to the ongoing test. Combined with the
Business Importance from the client’s value perspective, they contribute to determine the
size of loss at risk. The empirical guideline for rating it is in Table 46. The ratings are
from VL to VH with values from 1 to 5. The common reason for this is that test cases
which with high Criticality should be passed as early as possible, otherwise, it would
block other test cases to be executed and might delay the whole testing process if defects
are not resolved soon enough.
Students are educated to refer the dependency tree/graph for rating this. For
“Project Paper Less” test case dependency tree as shown in Figure 27, for the ones TC-01-
01, TC-03-01 and TC-04-01, they are all rated Very High, because they are on the “critical
path” for executing all other test cases, if they fail, it would block most of the other test
cases to be executed and most of those blocked test cases have high Business Importance.
VL 11%
L 21%
N 14%
H 50%
VH 4%
VL L
N H
VH
109
Most of the other test cases are tree leaves, if they fail, they won’t block other test cases to
be executed and their Criticality are rated Very Low.
Table 46. Guideline for rating Criticality for test cases
VH:5 Block most (70%-100%) of the test cases, AND most of those blocked test cases have High Business Importance or above
H:4 Block most (70%-100%) of the test cases, OR most of those blocked test cases have High Business Importance or above
N:3 Block some (40%-70%) of the test cases, AND most of those blocked test cases have Normal Business Importance
L:2 Block a few (0%-40%) of the test cases, OR most of those blocked test cases have Normal Business Importance or below
VL:1 Won’t block any other test cases
7.2.4. The step to determine Failure Probability
The primary goal of testing is to reduce the uncertainty of the software product
quality before it is finally delivered to the client. Testing without risk analysis is a waste of
resources, and uncertainty and risk analysis are triggers for selecting the subset of test
suite, in order to focus the testing resources on the most risky, fault-prone features. A set
of self-check questions from different aspects or factors that might cause test case failure
are provided in Table 47 for students’ reference to rate the test case’s failure probability.
Students rated each test case’s Failure Probability based on those recommended factors or
others they might think of by themselves. The rating levels with numeric values are: Never
Fail (0), Least Likely to Fail (0.3), Have no idea (0.5), Most Likely to Fail (0.7), Fail for
sure (1).
110
Table 47. Self-check questions used for rating Failure Probability
Experience Did the test case fail before? --People tend to repeat previous mistakes, so does software. From pervious observations, e.g. unit test, performance
at CCD, or informal random testing, the test case failed before tends to fail again
Is the test case new? --The test case that hasn’t not been tested before has a higher probability to fail
Change Impact Does any recent code change (delete/modify/add) have impact on some features? --if so, the test cases for these features have a higher probability to fail
Personnel Are the people responsible for this feature qualified? -- If not, the test case for this feature tends to fail
Complexity Does the feature have some complex algorithm/ IO functions? --If so, the test case for this feature have a higher probability to fail
Dependencies Does this test cases have a lot of connections (either depend on or to be depended on) with other test case? --If so, this test case have a higher probability to fail
For “Project Paper Less”, before the acceptance testing, testers have already done
Core Capability Drive-through (CCD) for core capabilities developed in the first
increment, design-code review, unit test, informal random testing, testers have already
gained information and experiences about the health status of the software system they
developed. Based on this, they rated the Failure Probability for the whole 28 test cases.
The distribution of the rating levels are shown in Figure 30. Never Fail test cases consist
of more than half based on previous experiences and observations. So for those Never Fail
ones, they should be delayed to be executed at the end of each testing round if resources
are still available, or even not to be executed if time and testing resources are limited. So
in this way, quality risk analysis drives to shrink the test case suite and only choose to
execute those test case subsets with quality risks.
111
Figure 30. Failure Probability Distribution of Team01 Project
7.2.5. The step to determine Test Cost
Value-Based Software Engineering considers every activity as an investment. For
test activities, the cost/effort for executing each test case should also be considered for
TCP. However, estimating the effort to execute each test case is challenging [Deonandan
et al., 2010], [Ferreira et al., 2010]. Some practices simply suggest count the numbers of
steps to execute the test case. To simplify our experiment, students are also asked to write
test case on the same granularity level to make sure that every case has the nearly the same
number of steps to be executed as much as they can do, and assume that the cost for
executing each test case is the same.
7.2.6. The step for Value-Based Test Case Prioritization
As far as testers rated those factors above for each test case, Testing Value
triggered by RRL is defined as below:
Never Fail, 15, 54% Least Likely to
Fail, 6, 21%
Have no idea, 1, 4%
Most Likely to Fail, 6, 21%
Fail for sure, 0, 0%
Never Fail
Least Likely to Fail
Have no idea
Most Likely to Fail
Fail for sure
112
It is obvious from the definition of Testing Value that the Testing Value is in
proportion to Business Importance and Failure Probability and inversely proportional to
Testing cost. This allows test cases to be prioritized in terms of return on investment (ROI).
Students were asked to fill in each test case node with the number of Testing Value and
Criticality ratings as shown in Figure 27. Executing the ones with the highest Testing
Value and highest Criticality first is our basic prioritization strategy. However, due the
dependencies among test cases, a common situation is that testers cannot usually jump and
reach to the test case with the highest Testing Value directly without executing and
passing some others with lower Testing Value on the critical path to obtain the highest
one. For example, in Figure 27, TC-04-01 has the highest Testing Value (3.5) together
with highest Criticality rating (VH), but testers can’t directly execute it until TC-01-01 and
TC-03-01 on the critical path are executed and passed. So the factor of dependency should
also be added into the value-based TCP algorithm. Some key concepts below are
introduced to help understand the value-based TCP algorithm.
Passed: All steps in the test case generates the expected outputs that can make this feature work accordingly
Failed: As long as one of the steps in the test case generates an unexpected outputs to make this function can’t work or this failure would for sure block other test cases to be
executed if possible (some minor improvement suggestion doesn’t belong to this category )
NA: The test case is not able to be executed, there are some candidate reasons: This test case depends on another test case which fails; External factors, such as the testing environment e.g. the pre-condition could not be satisfied, or there is no required testing data, etc.
113
Dependencies Set: A test case’s Dependencies Set is the set of the test cases that this test case depends on. The Dependencies Set should include all dependent test cases, either directly or indirectly.
Ready-to-Test: it is a status of test cases, and its definition is: A test case is Ready-to-
Test only if the test case has no dependency or all the test cases in its Dependencies Set have been “Passed”.
Not-Tested-Yet: it is another status of test cases, and its definition is: A test case is Not-
Tested-Yet means this test case has not been tested yet so far.
The algorithm of value-based, dependency-aware Test Case Prioritization is shown
below with brief description in Figure 10. It is basically a variant of greedy algorithm with
the optimal goal of first selecting the Ready-to-Test one with the highest Testing Value
and Criticality to test.
Value First: Test the one with the highest Testing Value . If several test cases’ Testing Values are the same, test the one with the highest Criticality.
Dependency Second: If the test case selected from the first step is not “Ready-to-Test”, which means at least one of the test cases in its Dependencies Set is “Not-Tested-Yet”. In
such situation, prioritize the “Not-Tested-Yet” test cases according to “Value First” in this Dependencies Set and start to test until all test cases in the Dependencies Set are “Passed”. Then the test case with the highest value is “Ready-to-Test”.
Update the prioritization: After one round, update the Failure Probability based on updated observation from previous testing rounds.
114
Pick the one with the
highest Test Value (if the
same, choose the one
with higher Criticality)
Have dependencies?
All dependencies
passed?
Y
Start to testN
Y
Exclude the “Passed”
one for prioritization
Failed?
Exclude the “Failed” one
and the others “NA” that
depends on it for
prioritization
N
<<In the Dependencies Set>>
N
<<Ready-to-Test>>
<<Ready-to-Test>>
<<- -In the Whole Test
Case Set- ->>
Resovled?
Y
N
<<Report for Resolution>>
Figure 31. In-Process Value-Based TCP Algorithm
For “Project Paper Less”, 15 Never Fail test cases are excluded in the subset
selected to test, as shadowed in the dependency tree in Figure 27. For those test cases, it is
not necessary to test them deliberately if the testing effort or resources are limited; yet it is
ok to test them at the end of this round if time is still available. According to the Value-
Based TCP algorithm, the testing order for the remaining test cases is: