Top Banner
Fast Prediction of New Feature Utility Hoyt Koepke Misha Bilenko
21

Fast Prediction of New Feature Utility

Feb 23, 2016

Download

Documents

Graham Douglas

Fast Prediction of New Feature Utility. Hoyt Koepke Misha Bilenko . Machine Learning in Practice. To improve accuracy, we can improve: Training Supervision Features. Problem formulated as a prediction task. Implement learner, get supervision. Design, refine features. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fast  Prediction of New  Feature Utility

Fast Prediction ofNew Feature Utility

Hoyt Koepke Misha Bilenko

Page 2: Fast  Prediction of New  Feature Utility

Machine Learning in Practice

To improve accuracy, we can improve:– Training– Supervision– Features

Problem formulated as a prediction task

Design, refine features

Implement learner, get supervision

Train, validate,

ship

Page 3: Fast  Prediction of New  Feature Utility

Improving Accuracy By Improving

• Training – Algorithms, objectives/losses, hyper-parameters, …

• Supervision– Cleaning, labeling, sampling, semi-supervised

• Representation: refine/induce/add new features– Most ML engineering for mature applications happens here!– Process: let’s try this new extractor/data stream/transform/…• Manual or automatic [feature induction: Della Pietra et al.’97]

Page 4: Fast  Prediction of New  Feature Utility

Evaluating New Features• Standard procedure:

– Add features, re-run train/test/CV, hope accuracy improves

• In many applications, this is costly– Computationally: full re-training is – Monetarily: cost per feature-value (must check on a small sample)– Logistically: infrastructure pipelined, non-trivial, under-documented

• Goal: Efficiently check whether a new feature can improve accuracy without retraining

Page 5: Fast  Prediction of New  Feature Utility

Feature Relevance Feature Selection• Selection objective: removing existing features

• Relevance objective: decide if a new feature is worth adding

• Most feature selection methods either use re-training or estimate

• Feature relevance requires estimating

Page 6: Fast  Prediction of New  Feature Utility

Formalizing New Feature Relevance• Supervised learning setting

– Training set – Current predictor =– New feature

….….

Page 7: Fast  Prediction of New  Feature Utility

Formalizing New Feature Relevance• Supervised learning setting

– Training set – Current predictor =– New feature

• Hypothesis: can a better predictor be learned with the new feature?

• Too general Instead, let’s test an additive form: s.t.

For efficiency, we can just test:

s.t.

Page 8: Fast  Prediction of New  Feature Utility

Hypothesis Test for New Feature Relevance• We want to test whether has incremental signal:

s.t. • Intuition: loss gradient tells us how to improve the predictor• Consider functional loss gradient

– Since is locally optimal, : no descent direction exists• Theorem: under reasonable assumptions, is equivalent to:

> 0

where

Page 9: Fast  Prediction of New  Feature Utility

Hypothesis Test for New Feature Relevance > 0

• Intuition: can yield a descent direction in functional space? • Why this is cool:

Testing new feature relevance for a broad class of losses ⟺testing correlation between feature and normalized loss gradient

….

Page 10: Fast  Prediction of New  Feature Utility

Hypothesis Test for New Feature Relevance > 0

• Intuition: can yield a descent direction in functional space? • Why this is cool:

Testing new feature relevance for a broad class of losses ⟺testing correlation between feature and normalized loss gradient

Page 11: Fast  Prediction of New  Feature Utility

Testing Correlation to Loss Gradient• We don’t have a consistent test for > 0 …but ( locally optimal), so above is equivalent to:

s.t. …for which we can design a consistent bootstrap test!

• Intuition– We need to test if we can train regressor – We want it to be as powerful as possible and work on small samplesQ: How do we distinguish between true correlation and overfitting? A: We correct by correlation from

Page 12: Fast  Prediction of New  Feature Utility

New Feature Relevance: Algorithm

(1) Train best-fit regressor - Compute correlation between predictions and targets

(2) Repeat timesa) Draw independent bootstrap samples and b) Train best-fit regressor, compute correlation

(3) Score: correlation (1) corrected by (2)

Page 13: Fast  Prediction of New  Feature Utility

New Feature Relevance: Algorithm

Page 14: Fast  Prediction of New  Feature Utility

Connection to Boosting

• AnyBoost/gradient boosting additive form:– vs. – Gradient vs. coordinate descent in functional space

• Anyboost/GB: generalization

• This work: consistent hypothesis test for feasibility– Statistical stopping criteria for boosting?

Page 15: Fast  Prediction of New  Feature Utility

Experimental Validation

• Natural methodology: compare to full re-training• For each feature :– Actual – Predicted

• We are mainly interested in high- features

Page 16: Fast  Prediction of New  Feature Utility

Datasets

• WebSearch: each “feature” is a signal source• E.g., “Body” source defines all features that depend on document

body:–

• Signal source examples: AnchorText, ClickLog, etc.

Page 17: Fast  Prediction of New  Feature Utility

Results: Adult

Page 18: Fast  Prediction of New  Feature Utility

Results: Housing

Page 19: Fast  Prediction of New  Feature Utility

Results: WebSearch

Page 20: Fast  Prediction of New  Feature Utility

Comparison to Feature Selection

Page 21: Fast  Prediction of New  Feature Utility

New Feature Relevance: Summary

• Evaluating new features by re-training can be costly– Computationally, Financially, Logistically

• Fast alternative: testing correlation to loss gradient• Black-box algorithm: regression for (almost) any loss!• Just one approach, lots of future work: – Alternatives to hypothesis testing: info-theory, optimization, …– Semi-supervised methods– Back to feature selection? – Removing black-box assumptions