Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuredPerceptrons &StructuralSVMs

CS159:AdvancedTopicsinMachineLearning

1

4/6/2017

Recall:SequencePrediction• Input:x=(x1,…,xM)• Predict:y=(y1,…,yM)

– Eachyi oneofLlabels.

• x=“FishSleep”• y=(N,V)

• x=“TheDogAteMyHomework”• y=(D,N,V,D,N)

• x=“TheFoxJumpedOverTheFence”• y=(D,N,V,P,D,N)

2

POSTags:Det,Noun,Verb,Adj,Adv,Prep

L=6

Recall:1st OrderHMM

• x=(x1,x2,x4,x4,…,xM)(sequenceofwords)• y=(y1,y2,y3,y4,…,yM)(sequenceofPOStags)

• P(xj|yj)Probabilityofstateyj generatingxj

• P(yj+1|yj)Probabilityofstateyj transitioningtoyj+1

• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate

– Notalwaysused

3

HMMGraphicalModelRepresentation

4

Y1

X1

Y2

X2

YM

XM

…

…

P x, y( ) = P(End | yM ) P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

∏

Optional

Y0 YEnd

MostCommonPredictionProblem

• Giveninputsentence,predictPOSTagseq.

• SolveusingViterbi– Specialcaseofmaxproductalgorithm

5

ℎ 𝑥 = argmax)𝑃 𝑦 𝑥 = argmax log 𝑃(𝑦|𝑥)

log 𝑃 𝑦 𝑥 =2 log𝑃 𝑦3 𝑥3 + log 𝑃(𝑥3|𝑥356)�

8

• x=“FishSleep”• y=(N,V)

6

LogP P(N|*) P(V|*)

P(*|N) -2 1

P(*|V) 2 -2

P(*|Start) 1 -1

LogP P(*|N) P(*|V)

P(Fish|*) 2 1

P(Sleep|*) 1 0

𝐹 y, x ≡ log 𝑃 𝑦 𝑥 =2 log𝑃 𝑦3 𝑥3 + log 𝑃(𝑥3|𝑥356)�

8

SimpleExample

LogP(yj|xj) LogP(xj|xj-1)

NewNotation

• “UnaryFeatures”• “PairwiseTransitionFeatures”

7

F(y, x) ≡ wTϕ j (y j, y j−1 | x)#$ %&j=1

M

∑

ϕ j (a,b | x) =ϕ1

j (a | x)ϕ2 (a,b)

!

"

##

$

%

&&

ϕ1j (a | x) =

1a=Noun( )∧ x j='Fish '( )"

#$%

1a=Noun( )∧ x j='Sleep '( )"

#$%

1a=Verb( )∧ x j='Fish '( )"

#$%

1a=Verb( )∧ x j='Sleep '( )"

#$%

"

#

&&&&&&&&

$

%

''''''''w =

w1w2

!

"##

$

%&&

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

8

ϕ1j (a | x) =


#$%


#$%


#$%


#$%

"

#

&&&&&&&&

$

%

''''''''

ϕ11(Noun | "Fish Sleep") =

1000

!

"

####

$

%

&&&&

ϕ12 (Verb | "Fish Sleep") =

0001

!

"

####

$

%

&&&&

ϕ12 (Noun | "Fish Sleep") =

0100

!

"

####

$

%

&&&&

ϕ11(Verb | "Fish Sleep") =

0010

!

"

####

$

%

&&&&

NewNotationDuplicatewordfeaturesforeachlabel.

NounClassFeatures

VerbClassFeatures

ϕ1j (a | x) =

1 a=1[ ]φ1(xj )

!1 a=L[ ]φ1(x

j )

!

"

####

$

%

&&&&

9

ϕ1j (a | x) =


#$%


#$%


#$%


#$%

"

#

&&&&&&&&

$

%

''''''''

ϕ2 (Noun,Start) =

100000

!

"

#######

$

%

&&&&&&&

ϕ2 (Verb,Start) =

000100

!

"

#######

$

%

&&&&&&&

ϕ2 (Verb,Noun) =

000010

!

"

#######

$

%

&&&&&&&

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

NewNotationOnefeatureforeverytransition.

10

F(y, x) ≡ wTϕ j (y j, y j−1 | x)#$ %&j=1

M

∑ ϕ j (a,b | x) =ϕ1

j (a | x)

ϕ2j (a,b)

!

"

##

$

%

&&

w =w1w2

!

"##

$

%&&

w1 =

2110

!

"

####

$

%

&&&&

w2 =

1−22−11−2

"

#

$$$$$$$

%

&

'''''''

ϕ1j (a | x) =


#$%


#$%


#$%


#$%

"

#

&&&&&&&&

$

%

''''''''

P(N|*) P(V|*)P(*|N) -2 1P(*|V) 2 -2P(*|Start) 1 -1

P(*|N) P(*|V)P(Fish|*) 2 1P(Sleep|*) 1 0

OldNotation:

OldNotation:

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

• Input:x=(x1,…,xM)• Predict:y=(y1,…,yM)

– Eachyi oneofLlabels.

• LinearModelw.r.t.pairwisefeaturesφj(a,b|x):

• PredictionviamaximizingF:

Recap:1st OrderSequentialModel

11

POSTags:Det,Noun,Verb,Adj,Adv,Prep

L=6

ℎ 𝑥 = argmax)𝐹 𝑦, 𝑥 = argmax)𝑤>Ψ(𝑦, 𝑥)

EncodesStructure

F(y = (N,V ), x = "Fish Sleep") = w1Tϕ1

1(N, x)+w2Tϕ2 (N,Start)+w1

Tϕ12 (V, x)+w2

Tϕ2 (V,N ) = w1,1 +w2,1 +w1,4 +w2,5 = 2+1+ 0+1= 4

12

y F(y,x)

(N,N) 2+1+1-2=2

(N,V) 2+1+0+1=4

(V,N) 1-1+1+2=3

(V,V) 1-1+0-2=-2

w1 =

2110

!

"

####

$

%

&&&&

w2 =

1−22−11−2

"

#

$$$$$$$

%

&

'''''''

ϕ1j (a | x) =


#$%


#$%


#$%


#$%

"

#

&&&&&&&&

$

%

''''''''

ϕ2j (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

x=“FishSleep”y=(N,V)

argmaxy

F(y, x)Prediction:

WhyNewNotation?

• Easiertoreasonabout:– Computingpredictions– Learning(linearmodel!)– Extensions(justgeneralizeφ)

13

ϕ j (a,b | x) =ϕ1

j (a | x)ϕ2 (a,b)

!

"

##

$

%

&&

ϕ1j (a | x) =


#$%


#$%


#$%


#$%

"

#

&&&&&&&&

$

%

''''''''

w =w1w2

!

"##

$

%&&

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

GeneralizesMulticlass

• Stackweightvectorsforeachclass:

14

𝑤 =

𝑤6𝑤6⋮𝑤A

Ψ 𝑦, 𝑥 =

1 )C6 𝑥1 )CD 𝑥

⋮1 )CA 𝑥

𝐹(𝑦, 𝑥) ≡ 𝑤>Ψ(𝑦, 𝑥)

ℎ 𝑥 = argmax)𝑤>Ψ 𝑦, 𝑥 = argmax)𝑤)>𝑥

LearningforStructuredPrediction

15

PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0• Fort=1….

– Receiveexample(x,y)– Ifh(x|wt)=y

• wt+1 =wt

– Else• wt+1=wt +yx

16

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

ℎ 𝑥 = sign(𝑤>𝑥)

StructuredPerceptron(LinearClassificationModel)

• w1 =0• Fort=1….

– Receiveexample(x,y)– Ifh(x|wt)=y

• wt+1 =wt

– Else• wt+1=wt +Ψ(y,x)

17

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

ℎ 𝑥 = argmax)H𝑤>Ψ(𝑦′, 𝑥)

𝑆 = (𝑥8, 𝑦8)𝑦8 structured

Onlythingthatchanges!

StructuredPerceptron

18

DiscriminativeTrainingMethodsforHiddenMarkovModels:TheoryandExperimentswithPerceptronAlgorithmsMichaelCollins,EMNLP2002http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf

NP Chunking ResultsMethod F-Measure Numits

Perc, avg, cc=0 93.53 13Perc, noavg, cc=0 93.04 35Perc, avg, cc=5 93.33 9Perc, noavg, cc=5 91.88 39ME, cc=0 92.34 900ME, cc=5 92.65 200

POS Tagging Results

Method Error rate/% Numits

Perc, avg, cc=0 2.93 10Perc, noavg, cc=0 3.68 20Perc, avg, cc=5 3.03 6Perc, noavg, cc=5 4.04 17ME, cc=0 3.4 100ME, cc=5 3.28 200

Figure 4: Results for various methods on the part-of-speech tagging and chunking tasks on development data.All scores are error percentages. Numits is the numberof training iterations at which the best score is achieved.Perc is the perceptron algorithm, ME is the maximumentropy method. Avg/noavg is the perceptron with orwithout averaged parameter vectors. cc=5 means onlyfeatures occurring 5 times or more in training are in-cluded, cc=0 means all features in training are included.

4.3 Results

We applied both maximum-entropy models andthe perceptron algorithm to the two taggingproblems. We tested several variants for eachalgorithm on the development set, to gain someunderstanding of how the algorithms’ perfor-mance varied with various parameter settings,and to allow optimization of free parameters sothat the comparison on the final test set is a fairone. For both methods, we tried the algorithmswith feature count cut-o↵s set at 0 and 5 (i.e.,we ran experiments with all features in trainingdata included, or with all features occurring 5times or more included – (Ratnaparkhi 96) usesa count cut-o↵ of 5). In the perceptron algo-rithm, the number of iterations T over the train-ing set was varied, and the method was testedwith both averaged and unaveraged parametervectors (i.e., with ↵

T,n

s

and �

T,n

s

, as defined insection 2.5, for a variety of values for T ). Inthe maximum entropy model the number of it-erations of training using Generalized IterativeScaling was varied.

Figure 4 shows results on development dataon the two tasks. The trends are fairly clear:averaging improves results significantly for the

perceptron method, as does including all fea-tures rather than imposing a count cut-o↵ of 5.In contrast, the ME models’ performance su↵erswhen all features are included. The best percep-tron configuration gives improvements over themaximum-entropy models in both cases: an im-provement in F-measure from 92.65% to 93.53%in chunking, and a reduction from 3.28% to2.93% error rate in POS tagging. In lookingat the results for di↵erent numbers of iterationson development data we found that averagingnot only improves the best result, but also givesmuch greater stability of the tagger (the non-averaged variant has much greater variance inits scores).

As a final test, the perceptron and ME tag-gers were applied to the test sets, with the op-timal parameter settings on development data.On POS tagging the perceptron algorithm gave2.89% error compared to 3.28% error for themaximum-entropy model (a 11.9% relative re-duction in error). In NP chunking the percep-tron algorithm achieves an F-measure of 93.63%,in contrast to an F-measure of 93.29% for theME model (a 5.1% relative reduction in error).5 Proofs of the TheoremsThis section gives proofs of theorems 1 and 2.The proofs are adapted from proofs for the clas-sification case in (Freund & Schapire 99).

Proof of Theorem 1: Let ↵̄

k be the weightsbefore the k’th mistake is made. It follows that↵̄

1 = 0. Suppose the k’th mistake is made atthe i’th example. Take z to the output proposedat this example, z = argmax

y2GEN(xi

) �(xi

, y) ·↵̄

k. It follows from the algorithm updates that↵̄

k+1 = ↵̄

k + �(xi

, y

i

)� �(xi

, z). We take innerproducts of both sides with the vector U:U · ↵̄

k+1 = U · ↵̄

k + U · �(xi

, y

i

)�U · �(xi

, z)� U · ↵̄

k + �

where the inequality follows because of the prop-erty of U assumed in Eq. 3. Because ↵̄

1 = 0,and therefore U · ↵̄

1 = 0, it follows by induc-tion on k that for all k, U · ↵̄

k+1 � k�. Be-cause U · ↵̄

k+1 ||U|| ||↵̄k+1||, it follows that||↵̄k+1|| � k�.

We also derive an upper bound for ||↵̄k+1||2:||↵̄k+1||2 = ||↵̄k||2 + ||�(x

i

, y

i

)� �(xi

, z)||2

+2↵̄

k · (�(xi

, y

i

)� �(xi

, z))

||↵̄k||2 + R

2

LimitationsofPerceptron

• Notallmistakesarecreatedequal– OnePOStagwrongasbadasfive!– Evenworseformorecomplicatedproblems

19

Comparison

20

LARGE MARGIN METHODS FOR STRUCTURED AND INTERDEPENDENT OUTPUT VARIABLES

Method HMM CRF Perceptron SVMError 9.36 5.17 5.94 5.08

Table 2: Results of various algorithms on the named entity recognition task.

Method Train Err Test Err Const Avg LossSVM2 0.2±0.1 5.1±0.6 2824±106 1.02±0.01SVM△s2 0.4±0.4 5.1±0.8 2626±225 1.10±0.08SVM△m2 0.3±0.2 5.1±0.7 2628±119 1.17±0.12

Table 3: Results for various SVM formulations on the named entity recognition task (ε = 0.01,C = 1).

The label set in this corpus consists of non-name and the beginning and continuation of personnames, organizations, locations and miscellaneous names, resulting in a total of |Σ| = 9 differentlabels. In the setup followed in Altun et al. (2003), the joint feature map Ψ(x,y) is the histogramof state transition plus a set of features describing the emissions. An adapted version of the Viterbialgorithm is used to solve the argmax in line 6. For both perceptron and SVM a second degreepolynomial kernel was used.

The results given in Table 2 for the zero-one loss, compare the generative HMM with condi-tional random fields (CRF) (Lafferty et al., 2001), Collins’ perceptron and the SVM algorithm. Alldiscriminative learning methods substantially outperform the standard HMM. In addition, the SVMperforms slightly better than the perceptron and CRFs, demonstrating the benefit of a large marginapproach. Table 3 shows that all SVM formulations perform comparably, attributed to the fact thevast majority of the support label sequences end up having Hamming distance 1 to the correct labelsequence. Notice that for 0-1 loss functions all three SVM formulations are equivalent.

5.3 Sequence Alignment

To analyze the behavior of the algorithm for sequence alignment, we constructed a synthetic datasetaccording to the following sequence and local alignment model. The native sequence and the decoysare generated by drawing randomly from a 20 letter alphabet Σ= {1, ..,20} so that letter c ∈ Σ hasprobability c/210. Each sequence has length 50, and there are 10 decoys per native sequence. Togenerate the homologous sequence, we generate an alignment string of length 30 consisting of 4characters “match”, “substitute”, “insert” , “delete”. For simplicity of illustration, substitutionsare always c→ (c mod 20)+1. In the following experiments, matches occur with probability 0.2,substitutions with 0.4, insertion with 0.2, deletion with 0.2. The homologous sequence is createdby applying the alignment string to a randomly selected substring of the native. The shortening ofthe sequences through insertions and deletions is padded by additional random characters.

We model this problem using local sequence alignment with the Smith-Waterman algorithm.Table 4 shows the test error rates (i.e. the percentage of times a decoy is selected instead of thehomologous sequence) depending on the number of training examples. The results are averagedover 10 train/test samples. The model contains 400 parameters in the substitution matrix Π and acost δ for “insert/delete”. We train this model using the SVM2 and compare against a generative

1479

LargeMarginMethodsforStructuredandInterdependentOutputVariablesIonnis Tsochantaridis,ThorstenJoachims,ThomasHofmann,Yasemin AltunJournalofMachineLearningResearch,Volume6,Pages1453-1484

HammingLoss

• HammingLoss:

• Truey=(D,N,V,D,N)– y’=(D,N,V,N,N)– y”=(V,D,N,V,V)

(Butnotcontinuous!)

21

ℓ y, x,F( ) = 1h(x ) j≠y j"#

$%

j=1

M

∑

y”hasmuchworsehammingloss(lossof5vs lossof1)

NeedtodefinecontinuoussurrogateofHingeLoss!

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

f(x)

Loss

0/1LossTargety

HingeLossargmin

w,b,ξ

12wTw+ C

Nξi

i∑

∀i : yi wT xi − b( ) ≥1−ξi

∀i :ξi ≥ 0

OriginalHingeLoss(SupportVectorMachine)

ℓ(yi, f (xi )) =max(0,1− yi f (xi )) = ξi

22

PropertyofHingeLoss

23

argminw,b,ξ

12wTw+ C

Nξi

i∑

∀i : yi wT xi − b( ) ≥1−ξi

∀i :ξi ≥ 0

ℓ(yi, f (xi )) =max(0,1− yi f (xi )) = ξi

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

Hingeloss=continuousupperboundon0/1loss

h(x) = argmaxy∈ −1,+1{ }

yf (x) = sign( f (x)) ξi ≥1 h(xi )≠yi[ ]è

HammingHingeLoss(StructuralSVM)

24

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

ℓ(yi, xi,F) =maxy '

1y ' j≠yi

j"#

$%

j∑ − F(yi, xi )−F(y ', xi )( )()*

+*

,-*

.*= ξi

ContinuousupperboundonHammingLoss!

SometimesnormalizebyM

h(x) = argmaxy

F(y, x)

ξi ≥ 1h(xi )

j≠yij#

$%&

j∑

è F(yi, xi )−F(h(xi ), xi ) ≤ 0

èLearnedPredictor

Supposeforincorrecty’=h(xi):Then:

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥1Mi

1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

HammingHingeLoss

ξi = 0.6 ≥ 0.5=1Mi

1h(xi )

j≠yij#

$%&

j∑

1.5$ 1.6$

0.5$

0$

0.5$

1$

1.5$

Score(y)$ Score(y')$ Loss(y')$

Score(yi) Score(y’) Loss(y’) Slack

25

SlackvariableupperboundsHammingLoss!

StructuralSVM

26

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

SometimesnormalizebyM

y ' = argmaxy

F(y, x)

ξi ≥ 1y ' j≠yi

j#$

%&

j∑

è F(yi, xi )−F(y ', xi ) ≤ 0

PredictionofLearnedModel

Consider:

y ' = yi

y ' ≠ yi è

è ξi ≥ 0

SlackiscontinuousupperboundonHammingLoss!

“Slack”

ReductiontoIndependentMulticlass

27

F(y, x) ≡ wTϕ j (y j | x)"# $%j=1

M

∑

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Suppose:

Nopairwisefeatures.

ϕ j (y j | x) =

1y j=1!"

#$φ1(x

j )

!1y j=L!"

#$φ1(x

j )

!

"

%%%%%

#

$

&&&&&

Stackfeaturesϕ1(xj)Ltimes

∀i, j,a :wyijT φ1(x

j )−waTφ1(x

j ) ≥1−ξijDecomposeconstraintstomulticlasshingelosspertoken!

28

xi =“FishSleep”yi =(N,V)

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0(V,N) 1 3 2(V,V) 1 3 1

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example1

ξi = 0

29

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 4 -1 1(N,V) 3 0 0(V,N) 0 3 2(V,V) 1 2 1

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example2


ξi = 2

30


argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example3


ξi =1

WhenisSlackPositive?

• Whenevermarginnotbigenough!

31

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

ξi =maxy '

1y ' j≠yi

j"#

$%

j∑ − F(yi, xi )−F(y ', xi )( )()*

+*

,-*

.*= ℓ(yi, xi,F)

Verifythatabovedefinition≥0

ξi > 0 ∃y ' : F(yi, xi )−F(y ', xi )< 1y ' j≠yi

j$%

&'

j∑çè

32

yi

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

HighDimensionalPointw

StructuralSVMGeometricInterpretation

y’y’

≤0èξi ≥ 1y ' j≠yi

j#$

%&

j∑

SizeofMarginvs

SizeofMarginViolations(Ccontrolstrade-off)

(MarginscaledbyHammingLoss)

𝐹 𝑦, 𝑥 = 𝑤>Ψ(𝑦, 𝑥)

StructuralSVMTraining

• Strictlyconvexoptimizationproblem– SameformasstandardSVMoptimization– Easyright?

• Intractable#ofconstraints!

33

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

OftenExponentiallyMany!

StructuralSVMTraining

• Thetrickistonotenumerateallconstraints.

• OnlysolvetheSVMobjectiveoverasmallsubsetofconstraints(workingset).– Efficient!

• Butsomeconstraintsmightbeviolated.

∀y ' : F(yi, xi ) ≥ F(y ', xi )+ 1y ' j≠yi

j$%

&'

j∑ −ξi

34

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0

35


argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example


ξi = 0

ApproximateHingeLoss• Choosetolerateε>0:

36

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi −ε ∀i :ξi ≥ 0

y ' = argmaxy

F(y, x)

ξi ≥ 1y ' j≠yi

j#$

%&

j∑ −ε

è F(yi, xi )−F(y ', xi ) ≤ 0

PredictionofLearnedModel

Consider:

y ' = yi

y ' ≠ yi è

è ξi ≥ 0

SlackiscontinuousupperboundonHammingLoss- ε!

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0

37


Example


ξi = 0ε =1

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi −ε ∀i :ξi ≥ 0

StructuralSVMTraining• STEP0:Specifytoleranceε

• STEP1:SolveSVMobjectivefunctionusingonlyworkingsetofconstraintsW(initiallyempty).Thetrainedmodelisw.

• STEP2:Usingw,findthey’whoseconstraintismostviolated.

• STEP3:Ifconstraintisviolatedbymorethanε,addittoW.

• RepeatSTEP1-3untilnoadditionalconstraintsareadded.ReturnmostrecentmodelwtrainedinSTEP1.

*This is known as a “cutting plane” method. 38

1Mi

1y ' j≠yi

j"#

$%

j∑ +ξi

'

())

*

+,,− F(yi, xi )−F(y ', xi )( ) ≥ ε

ConstraintViolationFormula:

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y '∈Wi :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

39

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0 0 1 1(N,V) 0 0 0 0(V,N) 0 0 2 2(V,V) 0 0 1 1

Example


ξi = 0Wi =∅Init: Solve:

Chooseε=0.1

Loss– Slack– (F(y,x)-F(y’,x))=ViolConstraintViolation:

argminw,ξ

12wTw+ C

Nξi

i∑


j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

40

Example


Wi = (V,N ){ }Update: ξi = 0Solve:

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0 0 1 1(N,V) 0 0 0 0(V,N) 0 0 2 2(V,V) 0 0 1 1

Chooseε=0.1


argminw,ξ

12wTw+ C

Nξi

i∑


j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

41

Example


Wi = (V,N ){ }Update: ξi = 0.5Solve:

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0.7 0.2 1 0.2(N,V) 0.9 0 0 0(V,N) -0.6 1.5 2 0(V,V) 0 0.9 1 0.4

Chooseε=0.1


argminw,ξ

12wTw+ C

Nξi

i∑


j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

42

Example


Wi = (V,N ), (N,N ){ }Update: ξi = 0.5Solve:

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0.7 0.2 1 0.2(N,V) 0.9 0 0 0(V,N) -0.6 1.5 2 0(V,V) 0 0.9 1 0.4

Chooseε=0.1


argminw,ξ

12wTw+ C

Nξi

i∑


j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

43

Example


Update: ξi = 0.55Solve:Wi = (V,N ), (N,N ){ }

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0.55 0.45 1 0(N,V) 1 0 0 0(V,N) -0.65 1.65 2 0(V,V) -0.05 0.95 1 0.05

Chooseε=0.1


Noconstraintisviolatedbymorethanε

GeometricExample

NaïveSVMProblem• Exponentialconstraints• Mostaredominatedbyasmallsetof“important” constraints

StructuralSVMApproach• Repeatedlyfindsthenextmost

violatedconstraint…• …untilsetofconstraintsisagood

approximation.

*Thisisknownasa“cuttingplane”method. 44

GeometricExample




approximation.


GeometricExample




approximation.


GeometricExample




approximation.


• Guaranteeforanyε>0:

• Terminatesafter#iterations:

argminw,ξ

12wTw+ C

Nξi

i∑


j&'

()

j∑ −ξi −ε

∀i :ξi ≥ 0

LinearConvergenceRate

48

O 1ε

!

"#$

%&

Prooffoundin:http://www.cs.cornell.edu/people/tj/publications/joachims_etal_09a.pdf

FindingMostViolatedConstraint

• Aconstraintisviolatedwhen:

• Findingmostviolatedconstraintreducesto

• Highlyrelatedtoprediction:

argmaxy '

F(y ', xi )+ 1y ' j≠yi

j"#

$%

j∑

“Lossaugmentedinference”

49

F(y ', xi )−F(yi, xi )+ 1y ' j≠yi

j#$

%&

j∑ −ξi > 0

argmaxy

F(y, xi )

“Augmented”ScoringFunction

50

!F(y, xi, yi ) ≡ !wT !ϕ j (y j, y j−1 | xi, yi )#$ %&j=1

M

∑

!ϕ j (a,b | xi, yi ) =ϕ j (a,b | xi )1a≠yi

j"#

$%

"

#

&&&

$

%

'''

!w = w1

!

"#

$

%&

argmaxy '

!F(y ', xi, yi )Goal:

argmaxy '

F(y ', xi )+ 1y ' j≠yi

j"#

$%

j∑

F(y, xi ) ≡ wTϕ j (y j, y j−1 | xi )#$ %&j=1

M

∑

Goal:

AdditionalUnaryFeature!

SolveUsingViterbi!

Structural SVM Recipe• Feature map: Ψ(𝑦, 𝑥)

• Inference: ℎ 𝑥 = argmax)𝐹 𝑦, 𝑥 ≡ 𝑤>Ψ(𝑦, 𝑥)

• Loss function: . Δ8 (𝑦)

• Loss-augmented:argmax)𝑤>Ψ 𝑦, 𝑥 +. Δ8 (𝑦)(most violated constraint)

Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

Documents