Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Weight AnnealingData Perturbation for Escaping Local Maxima in Learning

Gal Elidan, Matan Ninio, Nir FriedmanHebrew University

{galel,ninio,nir}@cs.huji.ac.il

Dale SchuurmansUniversity of [email protected]

The Learning Problem

),,( wDhScore

)(])[|][])[(( hPenmXmCmXhPwm

m

Density estimation:

Classification:

Logistic regression:

)()|][(log hPenhmXPwm

m

)()(][exp1log hPenxhmywm

m

Learning task: search for ),(maxarg DhScoreh

DATA

we

igh

ts

Hypothesis+ ),( DhScore

),,(maxarg wDhScoreh

Optimization is hard!

Typically resort to local optimization methods: gradient ascent, greedy

hill-climbing, EM

Escaping local maxima

These methods work by step perturbation during the local search

Local methods converge to (one of many) local optimum

TABU search Random restarts Simulated annealing S

core

h

Stuck here

Weight PerturbationOur Idea: Perturbation of instance weights

Puts stronger emphasis on a subset of the instances

Allows the learning procedure to escape local maxima

W

DATA

W

DATAperturb

Iterative Procedure

LOCAL SEARCH REWEIGHT

h

Score

Hypothesis

W

DATA

Benefits:Generality: a wide variety of learning scenariosModularity: Search is unchangedEffectiveness: Allows global changes

Iterative ProcedureTwo methods for reweighting Random: Sampling random weights Adversarial: Directed reweighting

To maximize on original goal

slowly diminish magnitude of perturbations

Random Reweighting

When hot, model can “go” almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data

Wt+1

Wt+2

W*

P(W)

W

Variance temp

Mean is original weight

Wt

Distance from original W

Wt

Adversarial ReweightingIdea: Challenge model by increasing w of “bad” (low scoring) instances

W*

Wt+1

Converge towards original distribution by constraining

distance from W*

Challenge the model by emphasizing bad samples

(minimize the score using W)A min-max game between re-weighting and optimizer

tw

t Scoretempηww exp*1

Kivinen & Warmuth

Learning Bayesian Networks A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

DATA

we

igh

ts

The Alarm network

Learning task: find structure + parameters that maximize score

5 10 15 20 25 30 35 40-15.5

-15.45

-15.4

-15.35

-15.3

-15.25

-15.2

-15.15

Iterations

Lo

g-l

oss

/in

stan

ce o

n t

est

With similar running time: Random is superior to random re-starts Single Adversary run competes with random

Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe)

BASELINE

Random annealingAdversary

Alarm network: 37 variables, 1000 samples

HOT COLD

TRUE STRUCTURE

Lo

g-l

oss

/in

stan

ce o

n t

est

dat

a

Percent at least this good

Alarm network: 37 variables, 1000 samples

102030405060708090

BASELINE

GENERATING MODEL

-15.

1-1

5.08

-15.

06-1

5.04

-15.

02-1

5-1

4.98

-14.

96Search with missing values

Missing values introduce many local maxima EM combines search & parameters estimation (SEM)

With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best

90% of random better then baseline

Distance to true generating model

is halved!RANDOM

ADVERSARY

Real-life datasets 6 real-life examples with and without missing values

201512

VariablesSamples

36446

30300

70200

36546

13100

StockSoybean

RosettaAudio

Soy-MPromoter

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Lo

g-l

oss

/ i

nst

ance

on

tes

t d

ata

BASELINE

Adversary20-80% Random

With similar running time: Adversary is efficient and preferable Random takes longer for inferior results

Represent using a motif Position Specific Scoring Matrix:

---------

ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT

Learning Sequence Motifs

N

n j ijini Sθ

KLSScore

1,n exp

1loglogisticw),(

Motif

DNA Promoter Sequences

Highly non-linear score optimization is hard!

ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT

A 0.97 0 0 0.02 0

C 0 0.01 0.99 0 0.2

G 0 0.99 0.1 0 0.8

T 0.03 0 0 0.98 0

Segal et al., RECOMB 2002

Real-life Motifs Results Construct PSSM: find that maximize the score Experiments on 9 transcription factors (motifs)

ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6 -5

0

5

10

15

20

25

30

35

40

45

50

Motif

Lo

g-l

oss

on

tes

t d

ata

BASELINE

Adversary20-80% Random

PSSM: 4 letters x 20 positions, 550 sample

With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times

Simulated annealingSimulated annealing:

allow “bad” moves with some probability

P(move) f(temp,)

Score

h

Wasteful propose, evaluate, reject cycle

Needs a long time to escape local maxima

WORSE then baseline on Bayesian networks!

Summary and Future Work

General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP…

Promising empirical results approach “achievable” maximum

The BIG challenge:

THEORETICAL INSIGHTS

Adversary ≠ BoostingAdversary

Output: Single hypothesis

Weights: Converge to original distribution

Learning: ht+1 depends on ht

Boosting

An ensemble

Diverge from original distribution

ht+1 depends only on wt+1

same comparison is true of Random Vs. Bagging/Bootstrap

Other annealing methodsSimulated annealing: allow “bad” moves with some probability

P(move) f(temp,)

Score

h

Score

h

Deterministic annealing: Change scenery by changing family of h

complex hypothesis

simple hypothesis

Not good on Bayesian network! Is not naturally applicable!

Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Documents

original w slide

local maxima w data

random adversary slide

score slide

local search local methods

original data w t

random reweighting

em slide