Top Banner
Structured Perceptrons & Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017
51

Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

Jul 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuredPerceptrons &StructuralSVMs

CS159:AdvancedTopicsinMachineLearning

1

4/6/2017

Page 2: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

Recall:SequencePrediction• Input:x=(x1,…,xM)• Predict:y=(y1,…,yM)

– Eachyi oneofLlabels.

• x=“FishSleep”• y=(N,V)

• x=“TheDogAteMyHomework”• y=(D,N,V,D,N)

• x=“TheFoxJumpedOverTheFence”• y=(D,N,V,P,D,N)

2

POSTags:Det,Noun,Verb,Adj,Adv,Prep

L=6

Page 3: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

Recall:1st OrderHMM

• x=(x1,x2,x4,x4,…,xM)(sequenceofwords)• y=(y1,y2,y3,y4,…,yM)(sequenceofPOStags)

• P(xj|yj)Probabilityofstateyj generatingxj

• P(yj+1|yj)Probabilityofstateyj transitioningtoyj+1

• P(y1|y0)y0 isdefinedtobetheStartstate• P(End|yM)PriorprobabilityofyMbeingthefinalstate

– Notalwaysused

3

Page 4: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

HMMGraphicalModelRepresentation

4

Y1

X1

Y2

X2

YM

XM

P x, y( ) = P(End | yM ) P(yi | yi−1)i=1

M

∏ P(xi | yi )i=1

M

Optional

Y0 YEnd

Page 5: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

MostCommonPredictionProblem

• Giveninputsentence,predictPOSTagseq.

• SolveusingViterbi– Specialcaseofmaxproductalgorithm

5

ℎ 𝑥 = argmax)𝑃 𝑦 𝑥 = argmax log 𝑃(𝑦|𝑥)

log 𝑃 𝑦 𝑥 =2 log𝑃 𝑦3 𝑥3 + log 𝑃(𝑥3|𝑥356)�

8

Page 6: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

• x=“FishSleep”• y=(N,V)

6

LogP P(N|*) P(V|*)

P(*|N) -2 1

P(*|V) 2 -2

P(*|Start) 1 -1

LogP P(*|N) P(*|V)

P(Fish|*) 2 1

P(Sleep|*) 1 0

𝐹 y, x ≡ log 𝑃 𝑦 𝑥 =2 log𝑃 𝑦3 𝑥3 + log 𝑃(𝑥3|𝑥356)�

8

SimpleExample

LogP(yj|xj) LogP(xj|xj-1)

Page 7: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

NewNotation

• “UnaryFeatures”• “PairwiseTransitionFeatures”

7

F(y, x) ≡ wTϕ j (y j, y j−1 | x)#$ %&j=1

M

ϕ j (a,b | x) =ϕ1

j (a | x)ϕ2 (a,b)

!

"

##

$

%

&&

ϕ1j (a | x) =

1a=Noun( )∧ x j='Fish '( )"

#$%

1a=Noun( )∧ x j='Sleep '( )"

#$%

1a=Verb( )∧ x j='Fish '( )"

#$%

1a=Verb( )∧ x j='Sleep '( )"

#$%

"

#

&&&&&&&&

$

%

''''''''w =

w1w2

!

"##

$

%&&

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

Page 8: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

8

ϕ1j (a | x) =

1a=Noun( )∧ x j='Fish '( )"

#$%

1a=Noun( )∧ x j='Sleep '( )"

#$%

1a=Verb( )∧ x j='Fish '( )"

#$%

1a=Verb( )∧ x j='Sleep '( )"

#$%

"

#

&&&&&&&&

$

%

''''''''

ϕ11(Noun | "Fish Sleep") =

1000

!

"

####

$

%

&&&&

ϕ12 (Verb | "Fish Sleep") =

0001

!

"

####

$

%

&&&&

ϕ12 (Noun | "Fish Sleep") =

0100

!

"

####

$

%

&&&&

ϕ11(Verb | "Fish Sleep") =

0010

!

"

####

$

%

&&&&

NewNotationDuplicatewordfeaturesforeachlabel.

NounClassFeatures

VerbClassFeatures

ϕ1j (a | x) =

1 a=1[ ]φ1(xj )

!1 a=L[ ]φ1(x

j )

!

"

####

$

%

&&&&

Page 9: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

9

ϕ1j (a | x) =

1a=Noun( )∧ x j='Fish '( )"

#$%

1a=Noun( )∧ x j='Sleep '( )"

#$%

1a=Verb( )∧ x j='Fish '( )"

#$%

1a=Verb( )∧ x j='Sleep '( )"

#$%

"

#

&&&&&&&&

$

%

''''''''

ϕ2 (Noun,Start) =

100000

!

"

#######

$

%

&&&&&&&

ϕ2 (Verb,Start) =

000100

!

"

#######

$

%

&&&&&&&

ϕ2 (Verb,Noun) =

000010

!

"

#######

$

%

&&&&&&&

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

NewNotationOnefeatureforeverytransition.

Page 10: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

10

F(y, x) ≡ wTϕ j (y j, y j−1 | x)#$ %&j=1

M

∑ ϕ j (a,b | x) =ϕ1

j (a | x)

ϕ2j (a,b)

!

"

##

$

%

&&

w =w1w2

!

"##

$

%&&

w1 =

2110

!

"

####

$

%

&&&&

w2 =

1−22−11−2

"

#

$$$$$$$

%

&

'''''''

ϕ1j (a | x) =

1a=Noun( )∧ x j='Fish '( )"

#$%

1a=Noun( )∧ x j='Sleep '( )"

#$%

1a=Verb( )∧ x j='Fish '( )"

#$%

1a=Verb( )∧ x j='Sleep '( )"

#$%

"

#

&&&&&&&&

$

%

''''''''

P(N|*) P(V|*)P(*|N) -2 1P(*|V) 2 -2P(*|Start) 1 -1

P(*|N) P(*|V)P(Fish|*) 2 1P(Sleep|*) 1 0

OldNotation:

OldNotation:

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

Page 11: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

• Input:x=(x1,…,xM)• Predict:y=(y1,…,yM)

– Eachyi oneofLlabels.

• LinearModelw.r.t.pairwisefeaturesφj(a,b|x):

• PredictionviamaximizingF:

Recap:1st OrderSequentialModel

11

POSTags:Det,Noun,Verb,Adj,Adv,Prep

L=6

ℎ 𝑥 = argmax)𝐹 𝑦, 𝑥 = argmax)𝑤>Ψ(𝑦, 𝑥)

EncodesStructure

Page 12: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

F(y = (N,V ), x = "Fish Sleep") = w1Tϕ1

1(N, x)+w2Tϕ2 (N,Start)+w1

Tϕ12 (V, x)+w2

Tϕ2 (V,N ) = w1,1 +w2,1 +w1,4 +w2,5 = 2+1+ 0+1= 4

12

y F(y,x)

(N,N) 2+1+1-2=2

(N,V) 2+1+0+1=4

(V,N) 1-1+1+2=3

(V,V) 1-1+0-2=-2

w1 =

2110

!

"

####

$

%

&&&&

w2 =

1−22−11−2

"

#

$$$$$$$

%

&

'''''''

ϕ1j (a | x) =

1a=Noun( )∧ x j='Fish '( )"

#$%

1a=Noun( )∧ x j='Sleep '( )"

#$%

1a=Verb( )∧ x j='Fish '( )"

#$%

1a=Verb( )∧ x j='Sleep '( )"

#$%

"

#

&&&&&&&&

$

%

''''''''

ϕ2j (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

x=“FishSleep”y=(N,V)

argmaxy

F(y, x)Prediction:

Page 13: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

WhyNewNotation?

• Easiertoreasonabout:– Computingpredictions– Learning(linearmodel!)– Extensions(justgeneralizeφ)

13

ϕ j (a,b | x) =ϕ1

j (a | x)ϕ2 (a,b)

!

"

##

$

%

&&

ϕ1j (a | x) =

1a=Noun( )∧ x j='Fish '( )"

#$%

1a=Noun( )∧ x j='Sleep '( )"

#$%

1a=Verb( )∧ x j='Fish '( )"

#$%

1a=Verb( )∧ x j='Sleep '( )"

#$%

"

#

&&&&&&&&

$

%

''''''''

w =w1w2

!

"##

$

%&&

ϕ2 (a,b) =

1 a=Noun( )∧ b=Start( )"# $%

1 a=Noun( )∧ b=Noun( )"# $%

1 a=Noun( )∧ b=Verb( )"# $%

1 a=Verb( )∧ b=Start( )"# $%

1 a=Verb( )∧ b=Noun( )"# $%

1 a=Verb( )∧ b=Verb( )"# $%

"

#

&&&&&&&&&&

$

%

''''''''''

Page 14: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

GeneralizesMulticlass

• Stackweightvectorsforeachclass:

14

𝑤 =

𝑤6𝑤6⋮𝑤A

Ψ 𝑦, 𝑥 =

1 )C6 𝑥1 )CD 𝑥

⋮1 )CA 𝑥

𝐹(𝑦, 𝑥) ≡ 𝑤>Ψ(𝑦, 𝑥)

ℎ 𝑥 = argmax)𝑤>Ψ 𝑦, 𝑥 = argmax)𝑤)>𝑥

Page 15: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

LearningforStructuredPrediction

15

Page 16: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

PerceptronLearningAlgorithm(LinearClassificationModel)

• w1 =0• Fort=1….

– Receiveexample(x,y)– Ifh(x|wt)=y

• wt+1 =wt

– Else• wt+1=wt +yx

16

S = (xi, yi ){ }i=1N

y ∈ +1,−1{ }

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

ℎ 𝑥 = sign(𝑤>𝑥)

Page 17: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuredPerceptron(LinearClassificationModel)

• w1 =0• Fort=1….

– Receiveexample(x,y)– Ifh(x|wt)=y

• wt+1 =wt

– Else• wt+1=wt +Ψ(y,x)

17

TrainingSet:

Gothroughtrainingsetinarbitraryorder(e.g.,randomly)

ℎ 𝑥 = argmax)H𝑤>Ψ(𝑦′, 𝑥)

𝑆 = (𝑥8, 𝑦8)𝑦8 structured

Onlythingthatchanges!

Page 18: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuredPerceptron

18

DiscriminativeTrainingMethodsforHiddenMarkovModels:TheoryandExperimentswithPerceptronAlgorithmsMichaelCollins,EMNLP2002http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf

NP Chunking ResultsMethod F-Measure Numits

Perc, avg, cc=0 93.53 13Perc, noavg, cc=0 93.04 35Perc, avg, cc=5 93.33 9Perc, noavg, cc=5 91.88 39ME, cc=0 92.34 900ME, cc=5 92.65 200

POS Tagging Results

Method Error rate/% Numits

Perc, avg, cc=0 2.93 10Perc, noavg, cc=0 3.68 20Perc, avg, cc=5 3.03 6Perc, noavg, cc=5 4.04 17ME, cc=0 3.4 100ME, cc=5 3.28 200

Figure 4: Results for various methods on the part-of-speech tagging and chunking tasks on development data.All scores are error percentages. Numits is the numberof training iterations at which the best score is achieved.Perc is the perceptron algorithm, ME is the maximumentropy method. Avg/noavg is the perceptron with orwithout averaged parameter vectors. cc=5 means onlyfeatures occurring 5 times or more in training are in-cluded, cc=0 means all features in training are included.

4.3 Results

We applied both maximum-entropy models andthe perceptron algorithm to the two taggingproblems. We tested several variants for eachalgorithm on the development set, to gain someunderstanding of how the algorithms’ perfor-mance varied with various parameter settings,and to allow optimization of free parameters sothat the comparison on the final test set is a fairone. For both methods, we tried the algorithmswith feature count cut-o↵s set at 0 and 5 (i.e.,we ran experiments with all features in trainingdata included, or with all features occurring 5times or more included – (Ratnaparkhi 96) usesa count cut-o↵ of 5). In the perceptron algo-rithm, the number of iterations T over the train-ing set was varied, and the method was testedwith both averaged and unaveraged parametervectors (i.e., with ↵

T,n

s

and �

T,n

s

, as defined insection 2.5, for a variety of values for T ). Inthe maximum entropy model the number of it-erations of training using Generalized IterativeScaling was varied.

Figure 4 shows results on development dataon the two tasks. The trends are fairly clear:averaging improves results significantly for the

perceptron method, as does including all fea-tures rather than imposing a count cut-o↵ of 5.In contrast, the ME models’ performance su↵erswhen all features are included. The best percep-tron configuration gives improvements over themaximum-entropy models in both cases: an im-provement in F-measure from 92.65% to 93.53%in chunking, and a reduction from 3.28% to2.93% error rate in POS tagging. In lookingat the results for di↵erent numbers of iterationson development data we found that averagingnot only improves the best result, but also givesmuch greater stability of the tagger (the non-averaged variant has much greater variance inits scores).

As a final test, the perceptron and ME tag-gers were applied to the test sets, with the op-timal parameter settings on development data.On POS tagging the perceptron algorithm gave2.89% error compared to 3.28% error for themaximum-entropy model (a 11.9% relative re-duction in error). In NP chunking the percep-tron algorithm achieves an F-measure of 93.63%,in contrast to an F-measure of 93.29% for theME model (a 5.1% relative reduction in error).5 Proofs of the TheoremsThis section gives proofs of theorems 1 and 2.The proofs are adapted from proofs for the clas-sification case in (Freund & Schapire 99).

Proof of Theorem 1: Let ↵̄

k be the weightsbefore the k’th mistake is made. It follows that↵̄

1 = 0. Suppose the k’th mistake is made atthe i’th example. Take z to the output proposedat this example, z = argmax

y2GEN(xi

) �(xi

, y) ·↵̄

k. It follows from the algorithm updates that↵̄

k+1 = ↵̄

k + �(xi

, y

i

)� �(xi

, z). We take innerproducts of both sides with the vector U:U · ↵̄

k+1 = U · ↵̄

k + U · �(xi

, y

i

)�U · �(xi

, z)� U · ↵̄

k + �

where the inequality follows because of the prop-erty of U assumed in Eq. 3. Because ↵̄

1 = 0,and therefore U · ↵̄

1 = 0, it follows by induc-tion on k that for all k, U · ↵̄

k+1 � k�. Be-cause U · ↵̄

k+1 ||U|| ||↵̄k+1||, it follows that||↵̄k+1|| � k�.

We also derive an upper bound for ||↵̄k+1||2:||↵̄k+1||2 = ||↵̄k||2 + ||�(x

i

, y

i

)� �(xi

, z)||2

+2↵̄

k · (�(xi

, y

i

)� �(xi

, z))

||↵̄k||2 + R

2

Page 19: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

LimitationsofPerceptron

• Notallmistakesarecreatedequal– OnePOStagwrongasbadasfive!– Evenworseformorecomplicatedproblems

19

Page 20: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

Comparison

20

LARGE MARGIN METHODS FOR STRUCTURED AND INTERDEPENDENT OUTPUT VARIABLES

Method HMM CRF Perceptron SVMError 9.36 5.17 5.94 5.08

Table 2: Results of various algorithms on the named entity recognition task.

Method Train Err Test Err Const Avg LossSVM2 0.2±0.1 5.1±0.6 2824±106 1.02±0.01SVM△s2 0.4±0.4 5.1±0.8 2626±225 1.10±0.08SVM△m2 0.3±0.2 5.1±0.7 2628±119 1.17±0.12

Table 3: Results for various SVM formulations on the named entity recognition task (ε = 0.01,C = 1).

The label set in this corpus consists of non-name and the beginning and continuation of personnames, organizations, locations and miscellaneous names, resulting in a total of |Σ| = 9 differentlabels. In the setup followed in Altun et al. (2003), the joint feature map Ψ(x,y) is the histogramof state transition plus a set of features describing the emissions. An adapted version of the Viterbialgorithm is used to solve the argmax in line 6. For both perceptron and SVM a second degreepolynomial kernel was used.

The results given in Table 2 for the zero-one loss, compare the generative HMM with condi-tional random fields (CRF) (Lafferty et al., 2001), Collins’ perceptron and the SVM algorithm. Alldiscriminative learning methods substantially outperform the standard HMM. In addition, the SVMperforms slightly better than the perceptron and CRFs, demonstrating the benefit of a large marginapproach. Table 3 shows that all SVM formulations perform comparably, attributed to the fact thevast majority of the support label sequences end up having Hamming distance 1 to the correct labelsequence. Notice that for 0-1 loss functions all three SVM formulations are equivalent.

5.3 Sequence Alignment

To analyze the behavior of the algorithm for sequence alignment, we constructed a synthetic datasetaccording to the following sequence and local alignment model. The native sequence and the decoysare generated by drawing randomly from a 20 letter alphabet Σ= {1, ..,20} so that letter c ∈ Σ hasprobability c/210. Each sequence has length 50, and there are 10 decoys per native sequence. Togenerate the homologous sequence, we generate an alignment string of length 30 consisting of 4characters “match”, “substitute”, “insert” , “delete”. For simplicity of illustration, substitutionsare always c→ (c mod 20)+1. In the following experiments, matches occur with probability 0.2,substitutions with 0.4, insertion with 0.2, deletion with 0.2. The homologous sequence is createdby applying the alignment string to a randomly selected substring of the native. The shortening ofthe sequences through insertions and deletions is padded by additional random characters.

We model this problem using local sequence alignment with the Smith-Waterman algorithm.Table 4 shows the test error rates (i.e. the percentage of times a decoy is selected instead of thehomologous sequence) depending on the number of training examples. The results are averagedover 10 train/test samples. The model contains 400 parameters in the substitution matrix Π and acost δ for “insert/delete”. We train this model using the SVM2 and compare against a generative

1479

LargeMarginMethodsforStructuredandInterdependentOutputVariablesIonnis Tsochantaridis,ThorstenJoachims,ThomasHofmann,Yasemin AltunJournalofMachineLearningResearch,Volume6,Pages1453-1484

Page 21: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

HammingLoss

• HammingLoss:

• Truey=(D,N,V,D,N)– y’=(D,N,V,N,N)– y”=(V,D,N,V,V)

(Butnotcontinuous!)

21

ℓ y, x,F( ) = 1h(x ) j≠y j"#

$%

j=1

M

y”hasmuchworsehammingloss(lossof5vs lossof1)

NeedtodefinecontinuoussurrogateofHingeLoss!

Page 22: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

f(x)

Loss

0/1LossTargety

HingeLossargmin

w,b,ξ

12wTw+ C

Nξi

i∑

∀i : yi wT xi − b( ) ≥1−ξi

∀i :ξi ≥ 0

OriginalHingeLoss(SupportVectorMachine)

ℓ(yi, f (xi )) =max(0,1− yi f (xi )) = ξi

22

Page 23: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

PropertyofHingeLoss

23

argminw,b,ξ

12wTw+ C

Nξi

i∑

∀i : yi wT xi − b( ) ≥1−ξi

∀i :ξi ≥ 0

ℓ(yi, f (xi )) =max(0,1− yi f (xi )) = ξi

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

Hingeloss=continuousupperboundon0/1loss

h(x) = argmaxy∈ −1,+1{ }

yf (x) = sign( f (x)) ξi ≥1 h(xi )≠yi[ ]è

Page 24: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

HammingHingeLoss(StructuralSVM)

24

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

ℓ(yi, xi,F) =maxy '

1y ' j≠yi

j"#

$%

j∑ − F(yi, xi )−F(y ', xi )( )()*

+*

,-*

.*= ξi

ContinuousupperboundonHammingLoss!

SometimesnormalizebyM

h(x) = argmaxy

F(y, x)

ξi ≥ 1h(xi )

j≠yij#

$%&

j∑

è F(yi, xi )−F(h(xi ), xi ) ≤ 0

èLearnedPredictor

Page 25: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

Supposeforincorrecty’=h(xi):Then:

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥1Mi

1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

HammingHingeLoss

ξi = 0.6 ≥ 0.5=1Mi

1h(xi )

j≠yij#

$%&

j∑

1.5$ 1.6$

0.5$

0$

0.5$

1$

1.5$

Score(y)$ Score(y')$ Loss(y')$

Score(yi) Score(y’) Loss(y’) Slack

25

SlackvariableupperboundsHammingLoss!

Page 26: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuralSVM

26

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

SometimesnormalizebyM

y ' = argmaxy

F(y, x)

ξi ≥ 1y ' j≠yi

j#$

%&

j∑

è F(yi, xi )−F(y ', xi ) ≤ 0

PredictionofLearnedModel

Consider:

y ' = yi

y ' ≠ yi è

è ξi ≥ 0

SlackiscontinuousupperboundonHammingLoss!

“Slack”

Page 27: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

ReductiontoIndependentMulticlass

27

F(y, x) ≡ wTϕ j (y j | x)"# $%j=1

M

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Suppose:

Nopairwisefeatures.

ϕ j (y j | x) =

1y j=1!"

#$φ1(x

j )

!1y j=L!"

#$φ1(x

j )

!

"

%%%%%

#

$

&&&&&

Stackfeaturesϕ1(xj)Ltimes

∀i, j,a :wyijT φ1(x

j )−waTφ1(x

j ) ≥1−ξijDecomposeconstraintstomulticlasshingelosspertoken!

Page 28: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

28

xi =“FishSleep”yi =(N,V)

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0(V,N) 1 3 2(V,V) 1 3 1

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example1

ξi = 0

Page 29: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

29

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 4 -1 1(N,V) 3 0 0(V,N) 0 3 2(V,V) 1 2 1

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example2

xi =“FishSleep”yi =(N,V)

ξi = 2

Page 30: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

30

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0(V,N) 3 1 2(V,V) 1 3 1

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example3

xi =“FishSleep”yi =(N,V)

ξi =1

Page 31: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

WhenisSlackPositive?

• Whenevermarginnotbigenough!

31

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

ξi =maxy '

1y ' j≠yi

j"#

$%

j∑ − F(yi, xi )−F(y ', xi )( )()*

+*

,-*

.*= ℓ(yi, xi,F)

Verifythatabovedefinition≥0

ξi > 0 ∃y ' : F(yi, xi )−F(y ', xi )< 1y ' j≠yi

j$%

&'

j∑çè

Page 32: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

32

yi

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

HighDimensionalPointw

StructuralSVMGeometricInterpretation

y’y’

≤0èξi ≥ 1y ' j≠yi

j#$

%&

j∑

SizeofMarginvs

SizeofMarginViolations(Ccontrolstrade-off)

(MarginscaledbyHammingLoss)

𝐹 𝑦, 𝑥 = 𝑤>Ψ(𝑦, 𝑥)

Page 33: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuralSVMTraining

• Strictlyconvexoptimizationproblem– SameformasstandardSVMoptimization– Easyright?

• Intractable#ofconstraints!

33

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

OftenExponentiallyMany!

Page 34: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuralSVMTraining

• Thetrickistonotenumerateallconstraints.

• OnlysolvetheSVMobjectiveoverasmallsubsetofconstraints(workingset).– Efficient!

• Butsomeconstraintsmightbeviolated.

∀y ' : F(yi, xi ) ≥ F(y ', xi )+ 1y ' j≠yi

j$%

&'

j∑ −ξi

34

Page 35: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0

35

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0(V,N) 3 1 2(V,V) 1 3 1

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi ∀i :ξi ≥ 0

Example

xi =“FishSleep”yi =(N,V)

ξi = 0

Page 36: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

ApproximateHingeLoss• Choosetolerateε>0:

36

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi −ε ∀i :ξi ≥ 0

y ' = argmaxy

F(y, x)

ξi ≥ 1y ' j≠yi

j#$

%&

j∑ −ε

è F(yi, xi )−F(y ', xi ) ≤ 0

PredictionofLearnedModel

Consider:

y ' = yi

y ' ≠ yi è

è ξi ≥ 0

SlackiscontinuousupperboundonHammingLoss- ε!

Page 37: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0

37

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss(N,N) 2 2 1(N,V) 4 0 0(V,N) 3 1 2(V,V) 1 3 1

Example

xi =“FishSleep”yi =(N,V)

ξi = 0ε =1

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi −ε ∀i :ξi ≥ 0

Page 38: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

StructuralSVMTraining• STEP0:Specifytoleranceε

• STEP1:SolveSVMobjectivefunctionusingonlyworkingsetofconstraintsW(initiallyempty).Thetrainedmodelisw.

• STEP2:Usingw,findthey’whoseconstraintismostviolated.

• STEP3:Ifconstraintisviolatedbymorethanε,addittoW.

• RepeatSTEP1-3untilnoadditionalconstraintsareadded.ReturnmostrecentmodelwtrainedinSTEP1.

*This is known as a “cutting plane” method. 38

1Mi

1y ' j≠yi

j"#

$%

j∑ +ξi

'

())

*

+,,− F(yi, xi )−F(y ', xi )( ) ≥ ε

ConstraintViolationFormula:

Page 39: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y '∈Wi :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

39

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0 0 1 1(N,V) 0 0 0 0(V,N) 0 0 2 2(V,V) 0 0 1 1

Example

xi =“FishSleep”yi =(N,V)

ξi = 0Wi =∅Init: Solve:

Chooseε=0.1

Loss– Slack– (F(y,x)-F(y’,x))=ViolConstraintViolation:

Page 40: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y '∈Wi :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

40

Example

xi =“FishSleep”yi =(N,V)

Wi = (V,N ){ }Update: ξi = 0Solve:

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0 0 1 1(N,V) 0 0 0 0(V,N) 0 0 2 2(V,V) 0 0 1 1

Chooseε=0.1

Loss– Slack– (F(y,x)-F(y’,x))=ViolConstraintViolation:

Page 41: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y '∈Wi :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

41

Example

xi =“FishSleep”yi =(N,V)

Wi = (V,N ){ }Update: ξi = 0.5Solve:

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0.7 0.2 1 0.2(N,V) 0.9 0 0 0(V,N) -0.6 1.5 2 0(V,V) 0 0.9 1 0.4

Chooseε=0.1

Loss– Slack– (F(y,x)-F(y’,x))=ViolConstraintViolation:

Page 42: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y '∈Wi :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

42

Example

xi =“FishSleep”yi =(N,V)

Wi = (V,N ), (N,N ){ }Update: ξi = 0.5Solve:

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0.7 0.2 1 0.2(N,V) 0.9 0 0 0(V,N) -0.6 1.5 2 0(V,V) 0 0.9 1 0.4

Chooseε=0.1

Loss– Slack– (F(y,x)-F(y’,x))=ViolConstraintViolation:

Page 43: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y '∈Wi :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j'(

)*

j∑ −ξi −ε ∀i :ξi ≥ 0

43

Example

xi =“FishSleep”yi =(N,V)

Update: ξi = 0.55Solve:Wi = (V,N ), (N,N ){ }

y’ F(y’,xi) F(yi,xi)– F(y’,xi) Loss Viol.(N,N) 0.55 0.45 1 0(N,V) 1 0 0 0(V,N) -0.65 1.65 2 0(V,V) -0.05 0.95 1 0.05

Chooseε=0.1

Loss– Slack– (F(y,x)-F(y’,x))=ViolConstraintViolation:

Noconstraintisviolatedbymorethanε

Page 44: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

GeometricExample

NaïveSVMProblem• Exponentialconstraints• Mostaredominatedbyasmallsetof“important” constraints

StructuralSVMApproach• Repeatedlyfindsthenextmost

violatedconstraint…• …untilsetofconstraintsisagood

approximation.

*Thisisknownasa“cuttingplane”method. 44

Page 45: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

GeometricExample

NaïveSVMProblem• Exponentialconstraints• Mostaredominatedbyasmallsetof“important” constraints

StructuralSVMApproach• Repeatedlyfindsthenextmost

violatedconstraint…• …untilsetofconstraintsisagood

approximation.

*Thisisknownasa“cuttingplane”method. 45

Page 46: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

GeometricExample

NaïveSVMProblem• Exponentialconstraints• Mostaredominatedbyasmallsetof“important” constraints

StructuralSVMApproach• Repeatedlyfindsthenextmost

violatedconstraint…• …untilsetofconstraintsisagood

approximation.

*Thisisknownasa“cuttingplane”method. 46

Page 47: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

GeometricExample

NaïveSVMProblem• Exponentialconstraints• Mostaredominatedbyasmallsetof“important” constraints

StructuralSVMApproach• Repeatedlyfindsthenextmost

violatedconstraint…• …untilsetofconstraintsisagood

approximation.

*Thisisknownasa“cuttingplane”method. 47

Page 48: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

• Guaranteeforanyε>0:

• Terminatesafter#iterations:

argminw,ξ

12wTw+ C

Nξi

i∑

∀i, y ' :F(yi, xi )−F(y ', xi ) ≥ 1y ' j≠yi

j&'

()

j∑ −ξi −ε

∀i :ξi ≥ 0

LinearConvergenceRate

48

O 1ε

!

"#$

%&

Prooffoundin:http://www.cs.cornell.edu/people/tj/publications/joachims_etal_09a.pdf

Page 49: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

FindingMostViolatedConstraint

• Aconstraintisviolatedwhen:

• Findingmostviolatedconstraintreducesto

• Highlyrelatedtoprediction:

argmaxy '

F(y ', xi )+ 1y ' j≠yi

j"#

$%

j∑

“Lossaugmentedinference”

49

F(y ', xi )−F(yi, xi )+ 1y ' j≠yi

j#$

%&

j∑ −ξi > 0

argmaxy

F(y, xi )

Page 50: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

“Augmented”ScoringFunction

50

!F(y, xi, yi ) ≡ !wT !ϕ j (y j, y j−1 | xi, yi )#$ %&j=1

M

!ϕ j (a,b | xi, yi ) =ϕ j (a,b | xi )1a≠yi

j"#

$%

"

#

&&&

$

%

'''

!w = w1

!

"#

$

%&

argmaxy '

!F(y ', xi, yi )Goal:

argmaxy '

F(y ', xi )+ 1y ' j≠yi

j"#

$%

j∑

F(y, xi ) ≡ wTϕ j (y j, y j−1 | xi )#$ %&j=1

M

Goal:

AdditionalUnaryFeature!

SolveUsingViterbi!

Page 51: Structured Perceptrons& Structural SVMs · Structured Perceptrons& Structural SVMs CS 159: Advanced Topics in Machine Learning 1 4/6/2017

Structural SVM Recipe• Feature map: Ψ(𝑦, 𝑥)

• Inference: ℎ 𝑥 = argmax)𝐹 𝑦, 𝑥 ≡ 𝑤>Ψ(𝑦, 𝑥)

• Loss function: . Δ8 (𝑦)

• Loss-augmented:argmax)𝑤>Ψ 𝑦, 𝑥 +. Δ8 (𝑦)(most violated constraint)