Learning near optimum inspection policies

1

Learning near Learning near optimum inspection optimum inspection

[email protected] (WVU)[email protected] (WVU)

Zach Milton, WVUZach Milton, WVU

Feb 5Feb 520082008

2

The Briand Threshold% defectiveModulesdetected

% LOC read

(100,100)Goal: overthreshold thresh

old

3

“Manual Up”: the Koru Hypothesis

% defectiveModulesdetected

% LOC read

(100,100)

Smaller modules have disproportionately more defectsIf so, then we'll find more bugs sooner if we read “manualUp” (I.e.

read smallest modules first)

threshold

Manual

4

Optimum Detector


% LOC read

(100,100)

X% of the code in defective modules. Some perfect oracle finds all defective modules, which, whenwe inspect manualUp, we find all the defects

X%

threshold

Manualoptimal

5

Sub-optimum, useful automatic detector


% LOC read

(100,100)

Triggers on Y% of the code, not all of which is defective.Useful if above manual and threshold

X% Y%

threshold

optimal Manual

useful

6

Comparing two detectors


% LOC readReport detector performance as area = AUC(detector)/AUC(optimal• 0 <= area <= 1 (larger is better)• For 10 data sets, 10 randomizations or ordering, 3-way hold-outs (66%

train, 33% test):• 300 numbers for each detector; • compare with Mann-Whitney (99% confidence)

optimal

detector 2

detector1

7

Technical details


% LOC read

detector

We don’t knowThe trajectory from from Y% read to 100% read

Y% 100 %

We’ll make the mostpessimistic assumption(so our results are betterthan what we report below)

Other assumptions:• All bugs treated equally (no concept of defect severity)• Inspections are % effective at recognizing defective modules

(and since we report ratio of two AUC curves, cancels out)• So these results are independent of inspection

effectiveness)

8

Three class of detectors• Manual methods

– Manual up (inspect smallest modules first)– Manual down (inspect largest first)

• Traditional learners– J48, NaiveBayes, RIPPER

• A new learner– Different versions of WHICH– E.g. WHICH2loc discretizes log of numbers into two

bins and favors rules that selects least LOC– E.g. WHICH8 discretizes log of numbers into 8 bins

• For each learner– Take the modules selected via learning– Sort them in LOC size– Inspect them smallest to largest– Track when we stumble over a module with defects

9

What is WHICH?• WHICH= our new idea

– Technically: is a stochastic best first search, or SBFS.

– The implementation of this type of search is not done with a tree, but rather a stack.

• Motto of WHICH:– Start as you mean to go on

– If the learned theory is to be assessed via criteria “P”

– Use “P” at every step of growing, pruning the theory

• -Note: standard learners – Grow/prune via criteria “Q”, then assess the

learned theory via criteria “P”

10

The logic of WHICH

• If the red path in the above tree is a current rule that is scoring (via “P”) very well and the blue path is another rule that is scoring well also, why not skip the adding of one conjunction at a time?

• Instead combine the two paths so far and see if that works out better.

• This would essentially skip the growing a and bit move right to a potentially more optimum solution

11

WHICH Implementation

• Items in a stack scored and sorted via criteria “P”

• Once the stack is picked, two rules are selected randomly based on their scores and combined.

• The new rule is then scored and placed back in the stack.

• It is placed in sorted order.

outlook=overcasthumidity=highrain=truehumidity=lowrain=false...

outlook=overcastAND rain=true

12

WHICH Implementation

Continued

• New rules that score high have a better chance to be combined.

• This leads to bigger rules over time.

• This process is repeated several times until either– A total number of picks is reached

– or a criterion is met( an early stopping condition )

outlook=overcasthumidity=highoutlook=overcastAND rain=truerain=truehumidity=lowrain=false

humidity=highAND outlook=overcastAND rain=true

13

WHICH Summary

• WHICH initially creates a sorted stack of all attribute ranges in isolation.

• It then, based on score, randomly selects two rules from the stack, combines them, and places the new rule in the stack in sorted order.

• It continues to do this until a stopping criterion is met.

• WHICH supports both conjunction and disjunctions.

• If a the two rules selected both contain different ranges from the same attribute, they are OR'd together instead of AND'd

outlook=sunnyAND rain=true

outlook=overcast

outlook = [ sunny OR overcast ]AND rain = true

Sample results

WHICH2

manual Down

Manual upManual up

Others

But how representative are these results?

15

Results type #1 : Results type #1 : 5/8 examples5/8 examples

WHICH > manual > traditionalWHICH > manual > traditional

16

“areas” in cm1 which2, 0.0, 57.4, 68.1, 71.5, 81.5, [---------------------------- |+++++ ] manualUp, 48.3, 57.4, 59.8, 65.3, 71.5, [ -----| ++++ ] nBayes, 36.2, 46.0, 52.1, 59.1, 69.2, [ ----- | ++++++ ] manualDown, 33.6, 40.3, 47.6, 49.3, 60.2, [ ---- |++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 16.1, [++++++++ ] which8, 0.0, 0.0, 11.4, 26.2, 35.6, [ | +++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 10.4, [+++++ ] which4, 0.0, 0.0, 0.0, 41.2, 69.0, [ ++++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 40.7, [++++++++++++++++++++ ] jRip, 0.0, 0.0, 5.8, 11.5, 24.1, [ | +++++++ ] j48, 0.0, 0.0, 0.1, 12.9, 33.3, [ +++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 1, 9, 0, 9 manualUp, 1, 9, 0, 9 nBayes, 0, 8, 2, 6 manualDown, 0, 7, 3, 4 which8, 3, 3, 4, -1 which4, 3, 3, 4, -1 jRip, 3, 3, 4, -1 j48, 3, 3, 4, -1 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

1. Distributions of results

2. Statistical results comparing the distributions (which has the largest median ranked values?)

17

“areas” in KC1 which2, 71.4, 73.8, 76.0, 78.0, 81.8, [ --| ++ ] manualUp, 64.5, 65.8, 67.6, 68.9, 70.0, [ -|+ ] nBayes, 54.9, 60.2, 61.9, 63.0, 67.7, [ ---|+++ ] which4, 0.0, 49.8, 52.9, 55.2, 60.5, [------------------------ |+++ ] manualDown, 39.7, 42.2, 43.3, 45.2, 47.7, [ --|++ ] j48, 11.6, 20.5, 27.8, 31.7, 40.1, [ ----- | +++++ ] jRip, 10.2, 17.3, 21.3, 25.2, 32.4, [ ---- | ++++ ] which8loc, 0.0, 0.0, 0.0, 1.0, 2.2, [ ] which8, 0.0, 0.0, 0.0, 2.0, 33.9, [+++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 1.1, [ ] which2loc, 0.0, 0.0, 0.0, 0.0, 2.1, [+ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 manualUp, 0, 9, 1, 8 nBayes, 0, 8, 2, 6 which4, 0, 7, 3, 4 manualDown, 0, 6, 4, 2 j48, 0, 5, 5, 0 jRip, 0, 4, 6, -2 which8loc, 1, 2, 7, -5 which8, 3, 0, 7, -7 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

18

“areas” in KC2 which2, 65.6, 76.0, 81.6, 84.6, 88.5, [ ------ | ++ ] manualUp, 57.9, 65.4, 69.3, 71.0, 76.6, [ ---- |+++ ] nBayes, 47.0, 54.8, 58.7, 61.0, 69.4, [ ---- |+++++ ] which4, 43.1, 52.5, 59.4, 66.8, 79.6, [ ----- | +++++++ ] manualDown, 37.9, 43.1, 46.1, 52.3, 62.4, [ --- | ++++++ ] which8, 26.3, 36.5, 41.2, 47.6, 56.5, [ ------ | +++++ ] j48, 26.0, 36.1, 41.2, 45.9, 59.8, [ ------ | +++++++ ] jRip, 22.2, 36.0, 42.2, 49.5, 65.2, [ ------- | ++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 5.9, [++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 2.9, [+ ] which2loc, 0.0, 0.0, 0.0, 0.0, 3.1, [+ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 manualUp, 0, 9, 1, 8 which4, 1, 7, 2, 5 nBayes, 1, 7, 2, 5 manualDown, 1, 5, 4, 1 jRip, 3, 3, 4, -1 which8, 2, 3, 5, -2 j48, 2, 3, 5, -2 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

19

“areas” in MW1_mod which2, 35.8, 57.4, 62.4, 70.8, 83.3, [ ----------- | +++++++ ] manualDown, 42.8, 52.1, 60.2, 63.7, 71.8, [ ----- |+++++ ] manualUp, 37.1, 44.0, 47.8, 51.9, 62.5, [ ---- | ++++++ ] which8, 0.1, 35.6, 39.3, 47.6, 60.4, [----------------- | +++++++ ] nBayes, 19.5, 33.1, 41.7, 47.7, 62.1, [ ------- | ++++++++ ] which4, 0.0, 25.8, 42.7, 49.8, 60.6, [------------ | ++++++ ] j48, 0.0, 10.0, 20.0, 24.3, 42.9, [----- | ++++++++++ ] jRip, 0.0, 7.9, 15.8, 31.2, 49.4, [--- | ++++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 10.5, [+++++ ] which4loc, 0.0, 0.0, 0.0, 0.0, 10.4, [+++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 25.6, [++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 1, 9, 0, 9 manualDown, 1, 9, 0, 9 manualUp, 2, 6, 2, 4 which4, 3, 5, 2, 3 nBayes, 3, 5, 2, 3 which8, 2, 5, 3, 2 jRip, 1, 3, 6, -3 j48, 1, 3, 6, -3 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

manual down wins?

20

“areas” in PC1 which2, 0.0, 0.0, 65.0, 71.1, 81.8, [ | ++++++ ] manualUp, 52.1, 58.4, 60.6, 63.4, 71.6, [ ----|+++++ ] nBayes, 36.4, 46.1, 51.5, 53.4, 60.9, [ ----- |++++ ] manualDown, 32.3, 41.9, 44.6, 46.2, 55.3, [ ----- |+++++ ] j48, 3.1, 12.5, 19.2, 24.6, 41.5, [----- | +++++++++ ] jRip, 0.0, 11.0, 15.1, 23.2, 30.8, [----- | ++++ ] which8, 0.0, 9.1, 22.6, 30.7, 47.7, [---- | +++++++++ ] which8loc, 0.0, 0.0, 7.4, 12.7, 22.1, [ | +++++ ] which4loc, 0.0, 0.0, 3.8, 14.8, 30.3, [| ++++++++ ] which4, 0.0, 0.0, 0.0, 50.3, 59.0, [ +++++ ] which2loc, 0.0, 0.0, 0.0, 9.7, 26.3, [ +++++++++ ]

#key, ties, win, loss, win-loss @ 99% manualUp, 1, 9, 0, 9 which2, 2, 8, 0, 8 nBayes, 1, 8, 1, 7 manualDown, 1, 6, 3, 3 which8, 3, 3, 4, -1 jRip, 3, 3, 4, -1 j48, 3, 3, 4, -1 which4, 7, 0, 3, -3 which8loc, 3, 0, 7, -7 which4loc, 3, 0, 7, -7 which2loc, 3, 0, 7, -7

21

Results type #2:Results type #2:2/8 examples2/8 examples

Manual worse than Manual worse than

(WHICH or traditional data (WHICH or traditional data miners)miners)

22

“areas” in KC3_mod which2, 73.3, 82.4, 87.3, 90.5, 95.4, [ ----- | +++ ] nBayes, 45.5, 59.2, 64.2, 69.6, 75.4, [ ------- | +++ ] manualUp, 50.7, 57.5, 64.2, 68.1, 77.4, [ ---- | +++++ ] which4, 0.0, 40.5, 47.8, 58.6, 67.2, [-------------------- | +++++ ] manualDown, 31.3, 39.5, 47.6, 55.6, 66.8, [ ----- | ++++++ ] which8, 0.0, 36.2, 46.7, 52.7, 62.1, [------------------ | +++++ ] j48, 0.0, 13.6, 23.1, 28.9, 42.6, [------ | +++++++ ] jRip, 0.0, 13.1, 17.7, 23.9, 54.2, [------ | ++++++++++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 43.0, [+++++++++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 8.3, 19.7, [ ++++++ ] which2loc, 0.0, 0.0, 6.6, 18.9, 39.9, [ | +++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 nBayes, 1, 8, 1, 7 manualUp, 1, 8, 1, 7 which8, 2, 5, 3, 2 which4, 2, 5, 3, 2 manualDown, 2, 5, 3, 2 j48, 1, 3, 6, -3 jRip, 2, 2, 6, -4 which2loc, 2, 1, 7, -6 which4loc, 2, 0, 8, -8 which8loc, 1, 0, 9, -9

23

“areas” in PC3_mod which2, 70.6, 76.0, 79.3, 82.7, 88.4, [ --- | +++ ] nBayes, 58.8, 63.0, 67.4, 69.0, 75.4, [ --- |++++ ] which4, 56.2, 62.2, 65.3, 68.3, 77.5, [ --- | +++++ ] manualDown, 48.9, 55.3, 57.5, 60.1, 65.2, [ ----| +++ ] manualUp, 43.1, 47.7, 49.9, 52.4, 59.0, [ ---| ++++ ] j48, 0.0, 17.4, 22.7, 26.3, 36.5, [-------- | ++++++ ] which8, 0.0, 13.6, 31.9, 36.7, 43.7, [------ | ++++ ] jRip, 0.0, 6.3, 12.5, 19.4, 34.4, [--- | ++++++++ ] which4loc, 0.0, 2.1, 5.6, 9.8, 16.4, [-| ++++ ] which8loc, 0.0, 0.0, 0.0, 4.1, 16.1, [ ++++++ ] which2loc, 0.0, 0.0, 1.9, 6.6, 21.5, [ ++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 which4, 1, 8, 1, 7 nBayes, 1, 8, 1, 7 manualDown, 0, 7, 3, 4 manualUp, 0, 6, 4, 2 which8, 1, 4, 5, -1 j48, 1, 4, 5, -1 jRip, 0, 3, 7, -4 which4loc, 1, 1, 8, -7 which2loc, 2, 0, 8, -8 which8loc, 1, 0, 9, -9

manual down wins?

24

OnceOnceManual beatsManual beats

( WHICH or traditional data ( WHICH or traditional data miners)miners)

25

“areas” in MC2_mod manualUp, 63.3, 70.9, 74.3, 78.3, 80.4, [ ---- | ++ ] nBayes, 21.4, 46.6, 55.9, 59.1, 79.1, [ ------------- | ++++++++++ ] manualDown, 29.7, 38.1, 42.8, 47.2, 57.4, [ ----- | ++++++ ] j48, 21.9, 29.3, 43.7, 55.4, 69.7, [ ---- | ++++++++ ] jRip, 12.7, 17.0, 28.5, 35.2, 56.4, [ --- | +++++++++++ ] which8, 0.0, 11.2, 21.9, 27.4, 42.4, [----- | ++++++++ ] which8loc, 0.0, 0.0, 0.0, 0.0, 29.8, [++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 5.6, 14.9, [ +++++ ] which4, 0.0, 0.0, 5.6, 25.3, 47.9, [ | ++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 0.0, 21.0, [++++++++++ ] which2, 0.0, 0.0, 0.0, 40.8, 99.7, [ ++++++++++++++++++++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% manualUp, 0, 10, 0, 10 nBayes, 0, 9, 1, 8 manualDown, 1, 7, 2, 5 j48, 1, 7, 2, 5 jRip, 1, 5, 4, 1 which8, 3, 3, 4, -1 which4, 4, 1, 5, -4 which2, 5, 0, 5, -5 which4loc, 4, 0, 6, -6 which2loc, 4, 0, 6, -6 which8loc, 3, 0, 7, -7

26

OverallOverall

WHICH2 > manual > traditionalWHICH2 > manual > traditional

27

Across all data sets which2, 0.0, 66.8, 77.6, 85.6, 99.7, [--------------------------------- | ++++++++ ] manualUp, 37.1, 56.5, 63.7, 70.2, 80.4, [ ---------- | ++++++ ] nBayes, 19.5, 52.9, 61.2, 69.6, 82.4, [ ----------------- | +++++++ ] manualDown, 29.7, 42.3, 46.4, 53.4, 71.8, [ ------- | ++++++++++ ] which4, 0.0, 35.6, 53.7, 63.9, 96.7, [----------------- | +++++++++++++++++ ] which8, 0.0, 18.6, 35.5, 47.0, 92.5, [--------- | +++++++++++++++++++++++ ] j48, 0.0, 18.3, 27.9, 42.9, 72.0, [--------- | +++++++++++++++ ] jRip, 0.0, 13.3, 23.9, 39.7, 65.2, [------ | +++++++++++++ ] which8loc, 0.0, 0.0, 0.0, 6.7, 92.5, [ +++++++++++++++++++++++++++++++++++++++++++ ] which4loc, 0.0, 0.0, 0.0, 9.8, 96.7, [ ++++++++++++++++++++++++++++++++++++++++++++ ] which2loc, 0.0, 0.0, 0.0, 11.2, 97.0, [ +++++++++++++++++++++++++++++++++++++++++++ ]

#key, ties, win, loss, win-loss @ 99% which2, 0, 10, 0, 10 nBayes, 1, 8, 1, 7 manualUp, 1, 8, 1, 7 which4, 0, 7, 3, 4 manualDown, 0, 6, 4, 2 which8, 1, 4, 5, -1 j48, 1, 4, 5, -1 jRip, 0, 3, 7, -4 which8loc, 2, 0, 8, -8 which4loc, 2, 0, 8, -8 which2loc, 2, 0, 8, -8

28

ConclusionsConclusions

29

Overall• Don’t assess learners without a usage

context. – Here: context = “read less, find more”

• Some support for the Koru hypothesis• Value of manual (up or down) questionable

– Only outstandingly better in one data set– And worse than other methods in 4/10 data sets

• WHICH2– The general winner– Near optimum

• Min: 0% • Lower quartile: 67% • Median: 78%• 3rd quartile: 86%• Max: 99%

Still room forStill room forimprovementimprovement

30

Early stopping rules(useful, a little interesting)


% LOC read

optimal

detector1

Watch inspection rules to learn when enough is enough

31

Learning the actual number of defects(very useful, very interesting)


% LOC read

Curve1 = optimal - real defects

curve2 = inspections

Q:Can we learn curve1 from watching the growth of curve2?

A: Maybe. WHICH2’ s (50%,75%) percentile = (79%, 86%) (I.e. getting pretty close to curve2)

Learning near optimum inspection policies

Technology