Top Banner
Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il Dale Schuurmans University of Waterloo [email protected]. ca
18

Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Mar 31, 2015

Download

Documents

Ramiro Deere
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Weight AnnealingData Perturbation for Escaping Local Maxima in Learning

Gal Elidan, Matan Ninio, Nir FriedmanHebrew University

{galel,ninio,nir}@cs.huji.ac.il

Dale SchuurmansUniversity of [email protected]

Page 2: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

The Learning Problem

),,( wDhScore

)(])[|][])[(( hPenmXmCmXhPwm

m

Density estimation:

Classification:

Logistic regression:

)()|][(log hPenhmXPwm

m

)()(][exp1log hPenxhmywm

m

Learning task: search for ),(maxarg DhScoreh

DATA

we

igh

ts

Hypothesis+ ),( DhScore

),,(maxarg wDhScoreh

Optimization is hard!

Typically resort to local optimization methods: gradient ascent, greedy

hill-climbing, EM

Page 3: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Escaping local maxima

These methods work by step perturbation during the local search

Local methods converge to (one of many) local optimum

TABU search Random restarts Simulated annealing S

core

h

Stuck here

Page 4: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Weight PerturbationOur Idea: Perturbation of instance weights

Puts stronger emphasis on a subset of the instances

Allows the learning procedure to escape local maxima

W

DATA

W

DATAperturb

Page 5: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Iterative Procedure

LOCAL SEARCH REWEIGHT

h

Score

Hypothesis

W

DATA

Benefits:Generality: a wide variety of learning scenariosModularity: Search is unchangedEffectiveness: Allows global changes

Page 6: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Iterative ProcedureTwo methods for reweighting Random: Sampling random weights Adversarial: Directed reweighting

To maximize on original goal

slowly diminish magnitude of perturbations

Page 7: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Random Reweighting

When hot, model can “go” almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data

Wt+1

Wt+2

W*

P(W)

W

Variance temp

Mean is original weight

Wt

Distance from original W

Page 8: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Wt

Adversarial ReweightingIdea: Challenge model by increasing w of “bad” (low scoring) instances

W*

Wt+1

Converge towards original distribution by constraining

distance from W*

Challenge the model by emphasizing bad samples

(minimize the score using W)A min-max game between re-weighting and optimizer

tw

t Scoretempηww exp*1

Kivinen & Warmuth

Page 9: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Learning Bayesian Networks A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

DATA

we

igh

ts

The Alarm network

Learning task: find structure + parameters that maximize score

Page 10: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

5 10 15 20 25 30 35 40-15.5

-15.45

-15.4

-15.35

-15.3

-15.25

-15.2

-15.15

Iterations

Lo

g-l

oss

/in

stan

ce o

n t

est

With similar running time: Random is superior to random re-starts Single Adversary run competes with random

Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe)

BASELINE

Random annealingAdversary

Alarm network: 37 variables, 1000 samples

HOT COLD

TRUE STRUCTURE

Page 11: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Lo

g-l

oss

/in

stan

ce o

n t

est

dat

a

Percent at least this good

Alarm network: 37 variables, 1000 samples

102030405060708090

BASELINE

GENERATING MODEL

-15.

1-1

5.08

-15.

06-1

5.04

-15.

02-1

5-1

4.98

-14.

96Search with missing values

Missing values introduce many local maxima EM combines search & parameters estimation (SEM)

With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best

90% of random better then baseline

Distance to true generating model

is halved!RANDOM

ADVERSARY

Page 12: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Real-life datasets 6 real-life examples with and without missing values

201512

VariablesSamples

36446

30300

70200

36546

13100

StockSoybean

RosettaAudio

Soy-MPromoter

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Lo

g-l

oss

/ i

nst

ance

on

tes

t d

ata

BASELINE

Adversary20-80% Random

With similar running time: Adversary is efficient and preferable Random takes longer for inferior results

Page 13: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Represent using a motif Position Specific Scoring Matrix:

---------

ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT

Learning Sequence Motifs

N

n j ijini Sθ

KLSScore

1,n exp

1loglogisticw),(

Motif

DNA Promoter Sequences

Highly non-linear score optimization is hard!

ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT

A 0.97 0 0 0.02 0

C 0 0.01 0.99 0 0.2

G 0 0.99 0.1 0 0.8

T 0.03 0 0 0.98 0

Segal et al., RECOMB 2002

Page 14: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Real-life Motifs Results Construct PSSM: find that maximize the score Experiments on 9 transcription factors (motifs)

ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6 -5

0

5

10

15

20

25

30

35

40

45

50

Motif

Lo

g-l

oss

on

tes

t d

ata

BASELINE

Adversary20-80% Random

PSSM: 4 letters x 20 positions, 550 sample

With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times

Page 15: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Simulated annealingSimulated annealing:

allow “bad” moves with some probability

P(move) f(temp,)

Score

h

Wasteful propose, evaluate, reject cycle

Needs a long time to escape local maxima

WORSE then baseline on Bayesian networks!

Page 16: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Summary and Future Work

General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP…

Promising empirical results approach “achievable” maximum

The BIG challenge:

THEORETICAL INSIGHTS

Page 17: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Adversary ≠ BoostingAdversary

Output: Single hypothesis

Weights: Converge to original distribution

Learning: ht+1 depends on ht

Boosting

An ensemble

Diverge from original distribution

ht+1 depends only on wt+1

same comparison is true of Random Vs. Bagging/Bootstrap

Page 18: Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Other annealing methodsSimulated annealing: allow “bad” moves with some probability

P(move) f(temp,)

Score

h

Score

h

Deterministic annealing: Change scenery by changing family of h

complex hypothesis

simple hypothesis

Not good on Bayesian network! Is not naturally applicable!