Top Banner
Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Y u IBM T.J.Watson Shaw-hwa Lo Columbia University
25

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Mar 27, 2015

Download

Documents

Matthew Sanchez
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Inductive Learning in Less Than One Sequential Data Scan

Wei Fan, Haixun Wang, and Philip S. YuIBM T.J.Watson

Shaw-hwa LoColumbia University

Page 2: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Problems Many inductive algorithms are main

memory-based. When the dataset is bigger than the memory, it

will "thrash". Very low in efficiency when thrashing happens.

For algorithms that are not memory-based, Do we need to see every piece of data? Probably

not. Overfitting curve? Not practical.

Page 3: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Basic Idea:One Scan Algorithm

Batch 4

Batch 3

Batch 2

Batch 1

Algorithm

Model

Model

Model

Model

Page 4: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Loss and Benefit Loss function:

Evaluate performance. Benefit matrix – inverse of loss func

Traditional 0-1 loss b[x,x] = 1, b[x,y] = 0

Cost-sensitive loss Overhead of $90 to investigate a fraud. b[fraud, fraud] = $tranamt - $90. b[fraud, nonfraud] = $0. b[nonfraud, fraud] = -$90. b[nonfraud, nonfraud] = $0.

Page 5: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Probabilistic Modeling

is the probability that x is an instance of class

is the expected benefit Optimal decision

Page 6: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Example p(fraud|x) = 0.5 and tranamt = $200 e(fraud|x) = b[fraud,fraud]p(fraud|x)

+ b[nonfraud, fraud] p(nonfraud|x) =(200 – 90) x 0.5 + (-90) x 0.5 = $10

E(nonfraud|x) = b[fraud,nonfraud]p(fraud|x) + b[nonfraud,nonfraud]p(nonfraud|x) = 0 x 0.5 + 0 x 0.5 = always 0

Predict fraud since we get $10 back.

Page 7: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Combining Multiple Models

Individual benefits

Averaged benefits

Optimal decision

Page 8: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

How about accuracy

Page 9: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Do we need all K models?

We stop learning if k (< K) models have the same accuracy as K models with confidence p.

Ends up scanning the dataset less than 1.

Use statistical sampling.

Page 10: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Less than one scan

Batch 4

Batch 3

Batch 2

Batch 1

Algorithm AccurateEnough?

Model

Model

Model

No

Yes

Page 11: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Hoeffding’s inequality

Random variable within R=a-b After n observations, its mean value is

y. What is its error with confidence p

regardless of the distribution?

Page 12: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

When can we stop?

Use k models highest expected benefit

Hoeffding’s error: second highedt expected benefit

Hoeffding’s error:

The majority label is still with confidence p iff

Page 13: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Less Than One Scan Algorithm

Iterate the process on every instance from a validation set.

Until every instance has the same prediction as the full ensemble with confidence p.

Page 14: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Validation Set

If we fail on one example x, we do not need to examine on another one. So we can keep only one example in

memory at a time. If k base models’s prediction on x is

the same as K models. It is very likely that k+1 models will also

be the same as K models with the same confidence.

Page 15: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Validation Set

At anytime, we only need to keep one data item x from the validation set.

It is sequentially read from the validation set.

The validation set is read only once. What can be a validation set?

The training set itself A separate holdout set.

Page 16: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Amount of Data Scan

Training Set : at most one Validation Set: once. Using training as validation set:

Once we decide to train model from a batch, we do not use it for validation again.

How much is used to train model? Less than one.

Page 17: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Experiments

Donation Dataset:

Total benefits: donated charity minus overhead to send solicitations.

Page 18: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Experiment Setup

Inductive learners: C4.5 RIPPER NB

Number of base models: {8,16,32,64,128,256} Reports their average

Page 19: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Baseline Results (with C4.5)

Single model: $13292.7 Complete One Scan: $14702.9

The average of {8,16,32,64,128,256} We are actually $1410 higher than the

single model.

Page 20: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Less-than-one scan (with C4.5)

Full one scan: $14702 Less-than-one scan: $14828

Actually a little higher, $126. How much data scanned with 99.7%

confidence? 71%

Page 21: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Other datasets

Credit card fraud detection

Total benefits: Recovered fraud amount minus overhead

of investigation

Page 22: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Results

Baseline single: $733980 (with curtailed probability)

One scan ensemble: $804964 Less than one scan: $804914 Data scan amount: 64%

Page 23: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Smoothing effect.

Page 24: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Related Work Ensenbles:

Meta-learning (Chan and Stolfo): 2 scans Bagging (Breiman) and AdaBoost (Freund and

Schapire): multiple Use of Hoeffding’s inequality:

Aggregate query (Hellerstein et al) Streaming decision tree (Hulten and Domingos)

Single decision tree, less than one scan Scalable decision tree:

SPRINT (Shafer et al): multiple scans BOAT (Gehrke et al): 2 scans

Page 25: Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Conclusion

Both “one scan” and “less than one scan” have accuracy either similar or higher than the single model.

“Less than one scan” uses approximately 60% – 90% of data for training with loss of accuracy.