Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Inductive Learning in Less Than One Sequential Data Scan

Wei Fan, Haixun Wang, and Philip S. YuIBM T.J.Watson

Shaw-hwa LoColumbia University

Problems Many inductive algorithms are main

memory-based. When the dataset is bigger than the memory, it

will "thrash". Very low in efficiency when thrashing happens.

For algorithms that are not memory-based, Do we need to see every piece of data? Probably

not. Overfitting curve? Not practical.

Basic Idea:One Scan Algorithm

Batch 4

Batch 3

Batch 2

Batch 1

Algorithm

Model

Model

Model

Model

Loss and Benefit Loss function:

Evaluate performance. Benefit matrix – inverse of loss func

Traditional 0-1 loss b[x,x] = 1, b[x,y] = 0

Cost-sensitive loss Overhead of $90 to investigate a fraud. b[fraud, fraud] = $tranamt - $90. b[fraud, nonfraud] = $0. b[nonfraud, fraud] = -$90. b[nonfraud, nonfraud] = $0.

Probabilistic Modeling

is the probability that x is an instance of class

is the expected benefit Optimal decision

Example p(fraud|x) = 0.5 and tranamt = $200 e(fraud|x) = b[fraud,fraud]p(fraud|x)

+ b[nonfraud, fraud] p(nonfraud|x) =(200 – 90) x 0.5 + (-90) x 0.5 = $10

E(nonfraud|x) = b[fraud,nonfraud]p(fraud|x) + b[nonfraud,nonfraud]p(nonfraud|x) = 0 x 0.5 + 0 x 0.5 = always 0

Predict fraud since we get $10 back.

Combining Multiple Models

Individual benefits

Averaged benefits

Optimal decision

How about accuracy

Do we need all K models?

We stop learning if k (< K) models have the same accuracy as K models with confidence p.

Ends up scanning the dataset less than 1.

Use statistical sampling.

Less than one scan

Batch 4

Batch 3

Batch 2

Batch 1

Algorithm AccurateEnough?

Model

Model

Model

No

Yes

Hoeffding’s inequality

Random variable within R=a-b After n observations, its mean value is

y. What is its error with confidence p

regardless of the distribution?

When can we stop?

Use k models highest expected benefit

Hoeffding’s error: second highedt expected benefit

Hoeffding’s error:

The majority label is still with confidence p iff

Less Than One Scan Algorithm

Iterate the process on every instance from a validation set.

Until every instance has the same prediction as the full ensemble with confidence p.

Validation Set

If we fail on one example x, we do not need to examine on another one. So we can keep only one example in

memory at a time. If k base models’s prediction on x is

the same as K models. It is very likely that k+1 models will also

be the same as K models with the same confidence.

Validation Set

At anytime, we only need to keep one data item x from the validation set.

It is sequentially read from the validation set.

The validation set is read only once. What can be a validation set?

The training set itself A separate holdout set.

Amount of Data Scan

Training Set : at most one Validation Set: once. Using training as validation set:

Once we decide to train model from a batch, we do not use it for validation again.

How much is used to train model? Less than one.

Experiments

Donation Dataset:

Total benefits: donated charity minus overhead to send solicitations.

Experiment Setup

Inductive learners: C4.5 RIPPER NB

Number of base models: {8,16,32,64,128,256} Reports their average

Baseline Results (with C4.5)

Single model: $13292.7 Complete One Scan: $14702.9

The average of {8,16,32,64,128,256} We are actually $1410 higher than the

single model.

Less-than-one scan (with C4.5)

Full one scan: $14702 Less-than-one scan: $14828

Actually a little higher, $126. How much data scanned with 99.7%

confidence? 71%

Other datasets

Credit card fraud detection

Total benefits: Recovered fraud amount minus overhead

of investigation

Results

Baseline single: $733980 (with curtailed probability)

One scan ensemble: $804964 Less than one scan: $804914 Data scan amount: 64%

Smoothing effect.

Related Work Ensenbles:

Meta-learning (Chan and Stolfo): 2 scans Bagging (Breiman) and AdaBoost (Freund and

Schapire): multiple Use of Hoeffding’s inequality:

Aggregate query (Hellerstein et al) Streaming decision tree (Hulten and Domingos)

Single decision tree, less than one scan Scalable decision tree:

SPRINT (Shafer et al): multiple scans BOAT (Gehrke et al): 2 scans

Conclusion

Both “one scan” and “less than one scan” have accuracy either similar or higher than the single model.

“Less than one scan” uses approximately 60% – 90% of data for training with loss of accuracy.

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Documents

accuracy slide

average slide

scan batch

algorithm model slide

validation set

data scan training set

scan algorithm batch

scan ensemble