Top Banner
1 Learning near Learning near optimum inspection optimum inspection policies policies [email protected] (WVU) [email protected] (WVU) Zach Milton, WVU Zach Milton, WVU Feb 5 Feb 5 2008 2008
31

Learning near optimum inspection policies

Jul 09, 2015

Download

Technology

CS, NcState

High assurance software requires extensive and expensive assessment. There are many forms of software assessment, ranging from manual
inspections to automatic formal methods. These assessment methods differ in their effectiveness and the effort required to apply them. Typically, the more effective methods are more expensive. Hence, project managers often "bias" the assessment resources and apply more effort where they think that extra effort might be most useful.

If most of the assessment effort explores project artifacts A,B,C,D, then that leaves a "blind spot" in E,F,G,H,I,.... Blind spots can compromise high assurance software. It is therefore important to
discuss the bias introduced by the inspection policy. To but the matter in a nutshell, we need to ask "how blinding is our bias?"

This talk contrasts three different kinds of "bias" in selecting what code modules to inspect:

1) manual methods such as "read the biggest thing first/last"

2) traditional data mining methods such as those advocated by author and those deployed in NASA-funded inspection tools.

3) a new data miner called "WHICH"

We find that #1 usually outperforms #2 . This result calls into question many years of research by the speaker (translation: "oh dear.....").

But we also find that #3 almost always out-performs #1 or #2 (translation: "phew!!").

In fact #3 works so well that we speculate that it could be used as a proxy for determining the actual number of defects remaining to be found, after inspecting Z% of the code.

ABOUT THE SPEAKER: Dr. Tim Menzies ([email protected]) has been working on advanced modeling and AI since 1986. He received his PhD
from the University of New South Wales, Sydney, Australia and is the author of over 160 refereeed papers. A former research chair for NASA, Dr. Menzies is now a associate professor at the West Virginia University's Lane Department of Computer Science and Electrical Engineering. For more information, visit his web page at http://
menzies.us.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning near optimum inspection policies

1

Learning near Learning near optimum inspection optimum inspection

[email protected] (WVU)[email protected] (WVU)

Zach Milton, WVUZach Milton, WVU

Feb 5Feb 520082008

Page 2: Learning near optimum inspection policies

2

The Briand Threshold% defectiveModulesdetected

% LOC read

(100,100)Goal: overthreshold thresh

old

Page 3: Learning near optimum inspection policies

3

“Manual Up”: the Koru Hypothesis

% defectiveModulesdetected

% LOC read

(100,100)

Smaller modules have disproportionately more defectsIf so, then we'll find more bugs sooner if we read “manualUp” (I.e.

read smallest modules first)

threshold

Manual

Page 4: Learning near optimum inspection policies

4

Optimum Detector

% defectiveModulesdetected

% LOC read

(100,100)

X% of the code in defective modules. Some perfect oracle finds all defective modules, which, whenwe inspect manualUp, we find all the defects

X%

threshold

Manualoptimal

Page 5: Learning near optimum inspection policies

5

Sub-optimum, useful automatic detector

% defectiveModulesdetected

% LOC read

(100,100)

Triggers on Y% of the code, not all of which is defective.Useful if above manual and threshold

X% Y%

threshold

optimal Manual

useful

Page 6: Learning near optimum inspection policies

6

Comparing two detectors

% defectiveModulesdetected

% LOC readReport detector performance as area = AUC(detector)/AUC(optimal• 0 <= area <= 1 (larger is better)• For 10 data sets, 10 randomizations or ordering, 3-way hold-outs (66%

train, 33% test):• 300 numbers for each detector; • compare with Mann-Whitney (99% confidence)

optimal

detector 2

detector1

Page 7: Learning near optimum inspection policies

7

Technical details

% defectiveModulesdetected

% LOC read

detector

We don’t knowThe trajectory from from Y% read to 100% read

Y% 100 %

We’ll make the mostpessimistic assumption(so our results are betterthan what we report below)

Other assumptions:• All bugs treated equally (no concept of defect severity)• Inspections are % effective at recognizing defective modules

(and since we report ratio of two AUC curves, cancels out)• So these results are independent of inspection

effectiveness)

Page 8: Learning near optimum inspection policies

8

Three class of detectors• Manual methods

– Manual up (inspect smallest modules first)– Manual down (inspect largest first)

• Traditional learners– J48, NaiveBayes, RIPPER

• A new learner– Different versions of WHICH– E.g. WHICH2loc discretizes log of numbers into two

bins and favors rules that selects least LOC– E.g. WHICH8 discretizes log of numbers into 8 bins

• For each learner– Take the modules selected via learning– Sort them in LOC size– Inspect them smallest to largest– Track when we stumble over a module with defects

Page 9: Learning near optimum inspection policies

9

What is WHICH?• WHICH= our new idea

– Technically: is a stochastic best first search, or SBFS.

– The implementation of this type of search is not done with a tree, but rather a stack.

• Motto of WHICH:– Start as you mean to go on

– If the learned theory is to be assessed via criteria “P”

– Use “P” at every step of growing, pruning the theory

• -Note: standard learners – Grow/prune via criteria “Q”, then assess the

learned theory via criteria “P”

Page 10: Learning near optimum inspection policies

10

The logic of WHICH

• If the red path in the above tree is a current rule that is scoring (via “P”) very well and the blue path is another rule that is scoring well also, why not skip the adding of one conjunction at a time?

• Instead combine the two paths so far and see if that works out better.

• This would essentially skip the growing a and bit move right to a potentially more optimum solution

Page 11: Learning near optimum inspection policies

11

WHICH Implementation

• Items in a stack scored and sorted via criteria “P”

• Once the stack is picked, two rules are selected randomly based on their scores and combined.

• The new rule is then scored and placed back in the stack.

• It is placed in sorted order.

outlook=overcasthumidity=highrain=truehumidity=lowrain=false...

outlook=overcastAND rain=true

Page 12: Learning near optimum inspection policies

12

WHICH Implementation

Continued

• New rules that score high have a better chance to be combined.

• This leads to bigger rules over time.

• This process is repeated several times until either– A total number of picks is reached

– or a criterion is met( an early stopping condition )

outlook=overcasthumidity=highoutlook=overcastAND rain=truerain=truehumidity=lowrain=false

humidity=highAND outlook=overcastAND rain=true

Page 13: Learning near optimum inspection policies

13

WHICH Summary

• WHICH initially creates a sorted stack of all attribute ranges in isolation.

• It then, based on score, randomly selects two rules from the stack, combines them, and places the new rule in the stack in sorted order.

• It continues to do this until a stopping criterion is met.

• WHICH supports both conjunction and disjunctions.

• If a the two rules selected both contain different ranges from the same attribute, they are OR'd together instead of AND'd

outlook=sunnyAND rain=true

outlook=overcast

outlook = [ sunny OR overcast ]AND rain = true

Page 14: Learning near optimum inspection policies

Sample results

WHICH2

manual Down

Manual upManual up

Others

But how representative are these results?

Page 15: Learning near optimum inspection policies

15

Results type #1 : Results type #1 : 5/8 examples5/8 examples

WHICH > manual > traditionalWHICH > manual > traditional

Page 16: Learning near optimum inspection policies

16

“areas” in cm1 which2, 0.0, 57.4, 68.1, 71.5, 81.5, [---------------------------- |+++++ ] manualUp, 48.3, 57.4, 59.8, 65.3, 71.5, [ -----| ++++ ] nBayes, 36.2, 46.0, 52.1, 59.1, 69.2, [ ----- | ++++++ ] manualDown, 33.6, 40.3, 47.6, 49.3, 60.2, [ ---- |++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 16.1, [++++++++ ] which8, 0.0, 0.0, 11.4, 26.2, 35.6, [ | +++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 10.4, [+++++ ] which4, 0.0, 0.0, 0.0, 41.2, 69.0, [ ++++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 40.7, [++++++++++++++++++++ ] jRip, 0.0, 0.0, 5.8, 11.5, 24.1, [ | +++++++ ] j48, 0.0, 0.0, 0.1, 12.9, 33.3, [ +++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 1, 9, 0, 9 manualUp, 1, 9, 0, 9 nBayes, 0, 8, 2, 6 manualDown, 0, 7, 3, 4 which8, 3, 3, 4, -1 which4, 3, 3, 4, -1 jRip, 3, 3, 4, -1 j48, 3, 3, 4, -1 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

1. Distributions of results

2. Statistical results comparing the distributions (which has the largest median ranked values?)

Page 17: Learning near optimum inspection policies

17

“areas” in KC1 which2, 71.4, 73.8, 76.0, 78.0, 81.8, [ --| ++ ] manualUp, 64.5, 65.8, 67.6, 68.9, 70.0, [ -|+ ] nBayes, 54.9, 60.2, 61.9, 63.0, 67.7, [ ---|+++ ] which4, 0.0, 49.8, 52.9, 55.2, 60.5, [------------------------ |+++ ] manualDown, 39.7, 42.2, 43.3, 45.2, 47.7, [ --|++ ] j48, 11.6, 20.5, 27.8, 31.7, 40.1, [ ----- | +++++ ] jRip, 10.2, 17.3, 21.3, 25.2, 32.4, [ ---- | ++++ ] which8loc, 0.0, 0.0, 0.0, 1.0, 2.2, [ ] which8, 0.0, 0.0, 0.0, 2.0, 33.9, [+++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 1.1, [ ] which2loc, 0.0, 0.0, 0.0, 0.0, 2.1, [+ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 manualUp, 0, 9, 1, 8 nBayes, 0, 8, 2, 6 which4, 0, 7, 3, 4 manualDown, 0, 6, 4, 2 j48, 0, 5, 5, 0 jRip, 0, 4, 6, -2 which8loc, 1, 2, 7, -5 which8, 3, 0, 7, -7 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

Page 18: Learning near optimum inspection policies

18

“areas” in KC2 which2, 65.6, 76.0, 81.6, 84.6, 88.5, [ ------ | ++ ] manualUp, 57.9, 65.4, 69.3, 71.0, 76.6, [ ---- |+++ ] nBayes, 47.0, 54.8, 58.7, 61.0, 69.4, [ ---- |+++++ ] which4, 43.1, 52.5, 59.4, 66.8, 79.6, [ ----- | +++++++ ] manualDown, 37.9, 43.1, 46.1, 52.3, 62.4, [ --- | ++++++ ] which8, 26.3, 36.5, 41.2, 47.6, 56.5, [ ------ | +++++ ] j48, 26.0, 36.1, 41.2, 45.9, 59.8, [ ------ | +++++++ ] jRip, 22.2, 36.0, 42.2, 49.5, 65.2, [ ------- | ++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 5.9, [++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 2.9, [+ ] which2loc, 0.0, 0.0, 0.0, 0.0, 3.1, [+ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 manualUp, 0, 9, 1, 8 which4, 1, 7, 2, 5 nBayes, 1, 7, 2, 5 manualDown, 1, 5, 4, 1 jRip, 3, 3, 4, -1 which8, 2, 3, 5, -2 j48, 2, 3, 5, -2 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

Page 19: Learning near optimum inspection policies

19

“areas” in MW1_mod which2, 35.8, 57.4, 62.4, 70.8, 83.3, [ ----------- | +++++++ ] manualDown, 42.8, 52.1, 60.2, 63.7, 71.8, [ ----- |+++++ ] manualUp, 37.1, 44.0, 47.8, 51.9, 62.5, [ ---- | ++++++ ] which8, 0.1, 35.6, 39.3, 47.6, 60.4, [----------------- | +++++++ ] nBayes, 19.5, 33.1, 41.7, 47.7, 62.1, [ ------- | ++++++++ ] which4, 0.0, 25.8, 42.7, 49.8, 60.6, [------------ | ++++++ ] j48, 0.0, 10.0, 20.0, 24.3, 42.9, [----- | ++++++++++ ] jRip, 0.0, 7.9, 15.8, 31.2, 49.4, [--- | ++++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 10.5, [+++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 10.4, [+++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 25.6, [++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 1, 9, 0, 9 manualDown, 1, 9, 0, 9 manualUp, 2, 6, 2, 4 which4, 3, 5, 2, 3 nBayes, 3, 5, 2, 3 which8, 2, 5, 3, 2 jRip, 1, 3, 6, -3 j48, 1, 3, 6, -3 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

manual down wins?

Page 20: Learning near optimum inspection policies

20

“areas” in PC1 which2, 0.0, 0.0, 65.0, 71.1, 81.8, [ | ++++++ ] manualUp, 52.1, 58.4, 60.6, 63.4, 71.6, [ ----|+++++ ] nBayes, 36.4, 46.1, 51.5, 53.4, 60.9, [ ----- |++++ ] manualDown, 32.3, 41.9, 44.6, 46.2, 55.3, [ ----- |+++++ ] j48, 3.1, 12.5, 19.2, 24.6, 41.5, [----- | +++++++++ ] jRip, 0.0, 11.0, 15.1, 23.2, 30.8, [----- | ++++ ] which8, 0.0, 9.1, 22.6, 30.7, 47.7, [---- | +++++++++ ] which8loc, 0.0, 0.0, 7.4, 12.7, 22.1, [ | +++++ ] which4loc, 0.0, 0.0, 3.8, 14.8, 30.3, [| ++++++++ ] which4, 0.0, 0.0, 0.0, 50.3, 59.0, [ +++++ ] which2loc, 0.0, 0.0, 0.0, 9.7, 26.3, [ +++++++++ ]

#key, ties, win, loss, win-loss @ 99% manualUp, 1, 9, 0, 9 which2, 2, 8, 0, 8 nBayes, 1, 8, 1, 7 manualDown, 1, 6, 3, 3 which8, 3, 3, 4, -1 jRip, 3, 3, 4, -1 j48, 3, 3, 4, -1 which4, 7, 0, 3, -3 which8loc, 3, 0, 7, -7 which4loc, 3, 0, 7, -7 which2loc, 3, 0, 7, -7

Page 21: Learning near optimum inspection policies

21

Results type #2:Results type #2:2/8 examples2/8 examples

Manual worse than Manual worse than

(WHICH or traditional data (WHICH or traditional data miners)miners)

Page 22: Learning near optimum inspection policies

22

“areas” in KC3_mod which2, 73.3, 82.4, 87.3, 90.5, 95.4, [ ----- | +++ ] nBayes, 45.5, 59.2, 64.2, 69.6, 75.4, [ ------- | +++ ] manualUp, 50.7, 57.5, 64.2, 68.1, 77.4, [ ---- | +++++ ] which4, 0.0, 40.5, 47.8, 58.6, 67.2, [-------------------- | +++++ ] manualDown, 31.3, 39.5, 47.6, 55.6, 66.8, [ ----- | ++++++ ] which8, 0.0, 36.2, 46.7, 52.7, 62.1, [------------------ | +++++ ] j48, 0.0, 13.6, 23.1, 28.9, 42.6, [------ | +++++++ ] jRip, 0.0, 13.1, 17.7, 23.9, 54.2, [------ | ++++++++++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 43.0, [+++++++++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 8.3, 19.7, [ ++++++ ] which2loc, 0.0, 0.0, 6.6, 18.9, 39.9, [ | +++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 nBayes, 1, 8, 1, 7 manualUp, 1, 8, 1, 7 which8, 2, 5, 3, 2 which4, 2, 5, 3, 2 manualDown, 2, 5, 3, 2 j48, 1, 3, 6, -3 jRip, 2, 2, 6, -4 which2loc, 2, 1, 7, -6 which4loc, 2, 0, 8, -8 which8loc, 1, 0, 9, -9

Page 23: Learning near optimum inspection policies

23

“areas” in PC3_mod which2, 70.6, 76.0, 79.3, 82.7, 88.4, [ --- | +++ ] nBayes, 58.8, 63.0, 67.4, 69.0, 75.4, [ --- |++++ ] which4, 56.2, 62.2, 65.3, 68.3, 77.5, [ --- | +++++ ] manualDown, 48.9, 55.3, 57.5, 60.1, 65.2, [ ----| +++ ] manualUp, 43.1, 47.7, 49.9, 52.4, 59.0, [ ---| ++++ ] j48, 0.0, 17.4, 22.7, 26.3, 36.5, [-------- | ++++++ ] which8, 0.0, 13.6, 31.9, 36.7, 43.7, [------ | ++++ ] jRip, 0.0, 6.3, 12.5, 19.4, 34.4, [--- | ++++++++ ] which4loc, 0.0, 2.1, 5.6, 9.8, 16.4, [-| ++++ ] which8loc, 0.0, 0.0, 0.0, 4.1, 16.1, [ ++++++ ] which2loc, 0.0, 0.0, 1.9, 6.6, 21.5, [ ++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 which4, 1, 8, 1, 7 nBayes, 1, 8, 1, 7 manualDown, 0, 7, 3, 4 manualUp, 0, 6, 4, 2 which8, 1, 4, 5, -1 j48, 1, 4, 5, -1 jRip, 0, 3, 7, -4 which4loc, 1, 1, 8, -7 which2loc, 2, 0, 8, -8 which8loc, 1, 0, 9, -9

manual down wins?

Page 24: Learning near optimum inspection policies

24

OnceOnceManual beatsManual beats

( WHICH or traditional data ( WHICH or traditional data miners)miners)

Page 25: Learning near optimum inspection policies

25

“areas” in MC2_mod manualUp, 63.3, 70.9, 74.3, 78.3, 80.4, [ ---- | ++ ] nBayes, 21.4, 46.6, 55.9, 59.1, 79.1, [ ------------- | ++++++++++ ] manualDown, 29.7, 38.1, 42.8, 47.2, 57.4, [ ----- | ++++++ ] j48, 21.9, 29.3, 43.7, 55.4, 69.7, [ ---- | ++++++++ ] jRip, 12.7, 17.0, 28.5, 35.2, 56.4, [ --- | +++++++++++ ] which8, 0.0, 11.2, 21.9, 27.4, 42.4, [----- | ++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 29.8, [++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 5.6, 14.9, [ +++++ ] which4, 0.0, 0.0, 5.6, 25.3, 47.9, [ | ++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 21.0, [++++++++++ ] which2, 0.0, 0.0, 0.0, 40.8, 99.7, [ ++++++++++++++++++++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% manualUp, 0, 10, 0, 10 nBayes, 0, 9, 1, 8 manualDown, 1, 7, 2, 5 j48, 1, 7, 2, 5 jRip, 1, 5, 4, 1 which8, 3, 3, 4, -1 which4, 4, 1, 5, -4 which2, 5, 0, 5, -5 which4loc, 4, 0, 6, -6 which2loc, 4, 0, 6, -6 which8loc, 3, 0, 7, -7

Page 26: Learning near optimum inspection policies

26

OverallOverall

WHICH2 > manual > traditionalWHICH2 > manual > traditional

Page 27: Learning near optimum inspection policies

27

Across all data sets which2, 0.0, 66.8, 77.6, 85.6, 99.7, [--------------------------------- | ++++++++ ] manualUp, 37.1, 56.5, 63.7, 70.2, 80.4, [ ---------- | ++++++ ] nBayes, 19.5, 52.9, 61.2, 69.6, 82.4, [ ----------------- | +++++++ ] manualDown, 29.7, 42.3, 46.4, 53.4, 71.8, [ ------- | ++++++++++ ] which4, 0.0, 35.6, 53.7, 63.9, 96.7, [----------------- | +++++++++++++++++ ] which8, 0.0, 18.6, 35.5, 47.0, 92.5, [--------- | +++++++++++++++++++++++ ] j48, 0.0, 18.3, 27.9, 42.9, 72.0, [--------- | +++++++++++++++ ] jRip, 0.0, 13.3, 23.9, 39.7, 65.2, [------ | +++++++++++++ ] which8loc, 0.0, 0.0, 0.0, 6.7, 92.5, [ +++++++++++++++++++++++++++++++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 9.8, 96.7, [ ++++++++++++++++++++++++++++++++++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 11.2, 97.0, [ +++++++++++++++++++++++++++++++++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 nBayes, 1, 8, 1, 7 manualUp, 1, 8, 1, 7 which4, 0, 7, 3, 4 manualDown, 0, 6, 4, 2 which8, 1, 4, 5, -1 j48, 1, 4, 5, -1 jRip, 0, 3, 7, -4 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

Page 28: Learning near optimum inspection policies

28

ConclusionsConclusions

Page 29: Learning near optimum inspection policies

29

Overall• Don’t assess learners without a usage

context. – Here: context = “read less, find more”

• Some support for the Koru hypothesis• Value of manual (up or down) questionable

– Only outstandingly better in one data set– And worse than other methods in 4/10 data sets

• WHICH2– The general winner– Near optimum

• Min: 0% • Lower quartile: 67% • Median: 78%• 3rd quartile: 86%• Max: 99%

Still room forStill room forimprovementimprovement

Page 30: Learning near optimum inspection policies

30

Early stopping rules(useful, a little interesting)

% defectiveModulesdetected

% LOC read

optimal

detector1

Watch inspection rules to learn when enough is enough

Page 31: Learning near optimum inspection policies

31

Learning the actual number of defects(very useful, very interesting)

% defectiveModulesdetected

% LOC read

Curve1 = optimal - real defects

curve2 = inspections

Q:Can we learn curve1 from watching the growth of curve2?

A: Maybe. WHICH2’ s (50%,75%) percentile = (79%, 86%) (I.e. getting pretty close to curve2)