Top Banner
Boosting Shai Raffaeli Seminar in mathematical biology http:// www1.cs.columbia.edu/ ~freund /
64

Boosting

Jan 03, 2016

Download

Documents

noel-puckett

Boosting. Shai Raffaeli Seminar in mathematical biology. http://www1.cs.columbia.edu/~freund/. Male. Human Voice. Female. Toy Example. Computer receives telephone call Measures Pitch of voice Decides gender of caller. mean1. mean2. var1. var2. Probability. Generative modeling. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Boosting

Boosting

Shai Raffaeli

Seminar in mathematical biology

http://www1.cs.columbia.edu/~freund/

Page 2: Boosting

Toy Example

• Computer receives telephone call

• Measures Pitch of voice

• Decides gender of caller

HumanVoice

Male

Female

Page 3: Boosting

Generative modeling

Voice Pitch

Pro

babi

lity

mean1

var1

mean2

var2

Page 4: Boosting

Discriminative approach

Voice Pitch

No.

of

mis

take

s

Page 5: Boosting

Ill-behaved data

Voice Pitch

Pro

babi

lity

mean1 mean2

No.

of

mis

take

s

Page 6: Boosting

Traditional Statistics vs. Machine Learning

DataEstimated world state

PredictionsActionsStatistics

Decision Theory

Machine Learning

Page 7: Boosting
Page 8: Boosting

A weighted training set

Feature vectors

Binary labels {-1,+1}

Positive weights

x1,y1,w1 , x2 ,y2 ,w2 ,, xm ,ym ,wm

Page 9: Boosting

A weak learner

weak learner

A weak rule

hWeighted

training set

(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)

instances

x1,x2,x3,…,xnh

labels

y1,y2,y3,…,yn

The weak requirement:

Feature vectorB

inary labelN

on-negative weights

sum to 1

Page 10: Boosting

The boosting process

weak learner h1

(x1,y1,1/n), … (xn,yn,1/n)

weak learner h2(x1,y1,w1), … (xn,yn,wn)

h3(x1,y1,w1), … (xn,yn,wn) h4

(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6

(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8

(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT

(x1,y1,w1), … (xn,yn,wn)

Final rule: Sign[ ]h1 h2 hT

Page 11: Boosting

Adaboost

• Binary labels y = -1,+1

• margin(x,y) = y [ttht(x)]

• P(x,y) = (1/Z) exp (-margin(x,y))

• Given ht, we choose t to minimize

(x,y) exp (-margin(x,y))

Page 12: Boosting

t ln wit

i:ht xi 1,yi1 wi

t

i:ht xi 1,yi 1

wit exp yiFt 1(xi )

Adaboost

F0 x 0

Ft1 Ft tht

Get ht from weak learner

for t 1..T

Freund, Schapire 1997

Page 13: Boosting

Main property of adaboost

• If advantages of weak rules over random guessing are: T then in-sample error of final rule is at most

Page 14: Boosting

Adaboost as gradient descent

• Discriminator class: a linear discriminator in the space of “weak hypotheses”

• Original goal: find hyper plane with smallest number of mistakes – Known to be an NP-hard problem (no algorithm that

runs in time polynomial in d, where d is the dimension of the space)

• Computational method: Use exponential loss as a surrogate, perform gradient descent.

Page 15: Boosting

Margins view

}1,1{;, yRwx n )( xwsign Prediction =

+

- + +++

+

+

-

--

--

--

w

Correc

t

Mist

akes

Project

Margin

Cumulative # examples

Mistakes Correct

Margin = xw

xwy

)(

Page 16: Boosting

Adaboost et al.

Loss

Correct

MarginMistakes

Brownboost

LogitboostAdaboost = )( xwye

0-1 loss

Page 17: Boosting

One coordinate at a time

• Adaboost performs gradient descent on exponential loss• Adds one coordinate (“weak learner”) at each iteration.• Weak learning in binary classification = slightly better

than random guessing. • Weak learning in regression – unclear.• Uses example-weights to communicate the gradient

direction to the weak learner• Solves a computational problem

Page 18: Boosting

What is a good weak learner?

• The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label.

• Small enough to allow exhaustive search for the minimal weighted training error.

• Small enough to avoid over-fitting.• Should be able to calculate predicted label very

efficiently.• Rules can be “specialists” – predict only on a small

subset of the input space and abstain from predicting on the rest (output 0).

Page 19: Boosting

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

Page 20: Boosting

-0.2

Decision tree as a sum

X

Y-0.2

Y>5

+0.2-0.3

yesno

X>3

-0.1

no

yes

+0.1

+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

Page 21: Boosting

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

Page 22: Boosting

Example: Medical Diagnostics

•Cleve dataset from UC Irvine database.

•Heart disease diagnostics (+1=healthy,-1=sick)

•13 features from tests (real valued and discrete).

•303 instances.

Page 23: Boosting

Adtree for Cleveland heart-disease diagnostics problem

Page 24: Boosting

Cross-validated accuracy

Learning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boostin

g446 20.2% 0.5%

Boost Stumps

16 16.5% 0.8%

Page 25: Boosting
Page 26: Boosting

Curious phenomenon

Boosting decision trees

Using <10,000 training examples we fit >2,000,000 parameters

Page 27: Boosting

Explanation using margins

Margin

0-1 loss

Page 28: Boosting

Explanation using margins

Margin

0-1 loss

No examples with small margins!!

Page 29: Boosting

Experimental Evidence

Page 30: Boosting

Theorem

For any convex combination and any threshold

Probability of mistake

Fraction of training example with small margin

Size of training sample

VC dimension of weak rules

No dependence on numberNo dependence on number of weak rules of weak rules that are combined!!!that are combined!!!

Schapire, Freund, Bartlett & LeeAnnals of stat. 98

Page 31: Boosting

Suggested optimization problem

Margin

m

d

Page 32: Boosting

Idea of Proof

Page 33: Boosting
Page 34: Boosting

Applications of Boosting

• Academic research

• Applied research

• Commercial deployment

Page 35: Boosting

Academic research

Database Other BoostingError reduction

Cleveland 27.2 (DT) 16.5 39%

Promoters

22.0 (DT) 11.8 46%

Letter 13.8 (DT) 3.5 74%

Reuters 4 5.8, 6.0, 9.8 2.95 ~60%

Reuters 8 11.3, 12.1, 13.4 7.4 ~40%

% test error rates

Page 36: Boosting

Applied research

• “AT&T, How may I help you?”• Classify voice requests• Voice -> text -> category• Fourteen categories

Area code, AT&T service, billing credit, calling card, collect, competitor, dial assistance, directory, how to dial, person to person, rate, third party, time charge ,time

Schapire, Singer, Gorin 98

Page 37: Boosting

• Yes I’d like to place a collect call long distance please

• Operator I need to make a call but I need to bill it to my office

• Yes I’d like to place a call on my master card please

• I just called a number in Sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off my bill

Examples

collect

third party

billing credit

calling card

Page 38: Boosting

Calling card

Collect

call

Third party

Weak Rule

Category

Word occursWord does not occur

Weak rules generated by “boostexter”

Page 39: Boosting

Results

• 7844 training examples– hand transcribed

• 1000 test examples– hand / machine transcribed

• Accuracy with 20% rejected– Machine transcribed: 75%– Hand transcribed: 90%

Page 40: Boosting

Commercial deployment

• Distinguish business/residence customers

• Using statistics from call-detail records

• Alternating decision trees– Similar to boosting decision trees, more

flexible– Combines very simple rules– Can over-fit, cross validation used to stop

Freund, Mason, Rogers, Pregibon, Cortes 2000

Page 41: Boosting

Summary

• Boosting is a computational method for learning accurate classifiers

• Resistance to over-fit explained by margins• Underlying explanation –

large “neighborhoods” of good classifiers• Boosting has been applied successfully to a

variety of classification problems

Page 42: Boosting

DNA

Measurable quantity

Gene Regulation• Regulatory proteins bind to non-coding regulatory

sequence of a gene to control rate of transcription

regulators

mRNAtranscript

bindingsites

Page 43: Boosting

From mRNA to Protein

mRNAtranscript

Nucleus wall

ribosomeProteinfolding

Protein sequence

Page 44: Boosting

Protein Transcription Factors

regulator

Page 45: Boosting

Genome-wide Expression Data

Page 46: Boosting

• Microarrays measure mRNA transcript expression levels for all of the ~6000 yeast genes at once.

• Very noisy data• Rough time slice over all

compartments of many cells.• Protein expression not observed

Page 47: Boosting

Partial “Parts List” for YeastMany known and putative – Transcription factors– Signaling molecules

that activate transcription factors– Known and putative binding site “motifs” – In yeast, regulatory sequence = 500 bp

upstream region

TFSM

MTF

TF

MTF

Page 48: Boosting

Predict target gene regulatory response from regulator activity and binding site data

MicroarrayImage

R1 R2 RpR4R3 …..“Parent” gene expression G1

Target gene expression

G2

G3

G4

Gt

GeneClass: Problem Formulation

G1G2G3G4

Gt

Binding sites (motifs)in upstream region

M. Middendorf, A. Kundaje, C. Wiggins, Y. Freund, C. Leslie.Predicting Genetic Regulatory Response Using Classification. ISMB 2004.

Page 49: Boosting

Role of quantization

-1 +10

By Quantizing expression into three classesWe reduce noise but maintain most of signal

Weighting +1/-1 examples linearly with Expression level performs slightly better.

Page 50: Boosting

Problem setup

• Data point = Target gene X Microarray

• Input features:– Parent state {-1,0,+1}– Motif Presence {0,1}

• Predict output:– Target Gene {-1,+1}

Page 51: Boosting

Boosting with Alternating Decision Trees (ADTs)

• Use boosting to build a single ADT, margin-based generalization of decision tree

Splitter NodeIs MotifMIG1 presentAND ParentXBP1 up?Prediction Node

F(x) given by sum of prediction nodes alongall paths consistent with x

Page 52: Boosting

Statistical Validation• 10-fold cross-validation experiments, ~50,000

(gene/microarray) training examples • Significant correlation between prediction score and true

log expression ratio on held-out data.• Prediction accuracy on +1/-1 labels: 88.5%

Page 53: Boosting

Biological InterpretationFrom correlation to causation

• Good prediction only implies Correlation.• To infer causation we need to integrate additional knowledge.• Comparative case studies: train on similar conditions

(stresses), test on related experiments• Extract significant features from learned model

– Iteration score (IS): Boosting iteration at which feature first appearsIdentifies significant motifs, motif-parent pairs

– Abundance score (AS): Number of nodes in ADT containing featureIdentifies important regulators

• In silico knock-outs: remove significant regulator and retrain.

Page 54: Boosting

Case Study: Heat Shock and Osmolarity

Training set: Heat shock, osmolarity, amino acid starvation

Test set: Stationary phase, simultaneous heat shock+osmolarity

Results: Test error = 9.3% Supports Gasch hypothesis: heat shock and osmolarity

pathways independent, additive– High scoring parents (AS): USV1 (stationary phase and

heat shock), PPT1 (osmolarity response), GAC1 (response to heat)

Page 55: Boosting

Case Study: Heat Shock and Osmolarity

Results: High scoring binding sites (IS):

MSN2/MSN4 STRE element Heat shock related: HSF1 and RAP1 binding sitesOsmolarity/glycerol pathways: CAT8, MIG1, GCN4Amino acid starvation: GCN4, CHA4, MET31

– High scoring motif-parent pair (IS):TPK1~STRE pair (kinase that regulates MSN2 via

cellular localization) – indirect effect

TFMTF

PPMp

PMMp

Direct binding Indirect effect Co-occurrence

Page 56: Boosting

Case Study: In silico knockout

• Training and test sets: Same as heat shock and osmolarity case study

• Knockout: Remove USV1 from regulator list and retrain

• Results:– Test error: 12% (increase from 9%)– Identify putative downstream targets of USV1: target

genes that change from correct to incorrect label– GO annotation analysis reveals putative functions:

Nucleoside transport, cell-wall organization and biogenesis, heat-shock protein activity

– Putative functions match those identified in wet lab USV1 knockout (Segal et al., 2003)

Page 57: Boosting

Conclusions: Gene Regulation

• New predictive model for study of gene regulation– First gene regulation model to make quantitative

predictions. – Using actual expression levels - no clustering.– Strong prediction accuracy on held-out

experiments– Interpretable hypotheses: significant regulators,

binding motifs, regulator-motif pairs• New methodology for biological analysis:

comparative training/test studies, in silico knockouts

Page 58: Boosting

Summary

• Boosting is an efficient and flexible method for constructing complex and accurate classifiers.

• Correlation -> Causation : still a hard problem, requires domain specific expertise and integration of data sources.

Page 59: Boosting

Improvement suggestions...

• Use of binary labels simplify the algorithm, but doesn’t reflect reality.

• “Confusion table”.

Page 60: Boosting

The End.

Page 61: Boosting

Large margins

marginFT(x,y) Ý y

tht x t1

Ttt1

Ty

FT x

1

marginFT(x,y) 0 fT (x)y

Thesis:large margins => reliable predictions

Very similar to SVM.

Page 62: Boosting

Experimental Evidence

Page 63: Boosting

TheoremSchapire, Freund, Bartlett & Lee / Annals of statistics 1998

H: set of binary functions with VC-dimension d

C

ihi | hi H ,i 0, i 1

c C, 0, with probability1 w.r.t. T ~ Dm

P x ,y ~D sign c(x) y P x,y ~T marginc x,y

˜ O d / m

O log

1

T x1,y1 , x2 ,y2 ,..., xm ,ym ; T ~ Dm

No dependence on no. of combined functions!!!

Page 64: Boosting

Idea of Proof