Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Spectral Learning Methodsfor Finite-State Machines,

with Applications to Natural Language Processing

Borja Balle, Xavier Carreras, Franco Luque, Ariadna Quattoni

Traducció de la marca a altres idiomes11

La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble.

September 2011

B. Balle, X. Carreras, F. Luque, A. Quattoni Spectral Learning of FSM Sept. 2011 1 / 1

Overview

Probabilistic TransducersI Model input-output relations with hidden statesI As conditional distribution Pr[ y | x ] over stringsI With certain independence assumptions

H1 H2 H3 H4

X1

Y1

X2

Y2

X3

Y3

X4

Y4

· · ·

. . .

Input

Output

Hidden

I Used in many applications: NLP, biology, . . .I Hard to learn in general — usually EM algorithm is used


Overview

Spectral Learning Probabilistic Transducers

Our contribution:

I Fast learning algorithm for probabilistic FST

I With PAC-style theoretical guarantees

I Based on Observable Operator Model for FST

I Using spectral methods (Chang ’96, Mossel-Roch ’05, Hsu et al. ’09,Siddiqi et al. ’10)

I Performing better than EM in experiments with real data


Outline


Observable Operators for FST

Deriving Observable Operator Models

Given (x , y) ∈ (X × Y)t aligned sequences, model computesconditional probability (i.e. |x | = |y |)

Pr[ y | x ] =∑

h∈Ht Pr[ y ,h | x ] (marginalize states)

=∑

ht+1∈H Pr[ y ,ht+1 | x ] (independence assumptions)

= 1> αt+1 (vector form, αt+1 ∈ Rm)

= 1>Aytxt αt (forward-backward equations)

= 1>Aytxt · · ·A

y1x1 α (induction on t)

The choice of an operator Aba depends only on observable symbols



Observable Operator Model ParametersGiven X = {a1, . . . ,ak}, Y = {b1, . . . ,bl}, H = {c1, . . . , cm}, then

Pr[ y | x ] = 1>Aytxt · · ·A

y1x1 α with parameters:

Aba = Ta Db ∈ Rm×m (factorized operator)

Ta(i , j) = Pr[Hs = ci |Xs−1 = a,Hs−1 = cj ] ∈ Rm×m (state transition)

Db(i , j) = δi,j Pr[Ys = b|Hs = cj ] ∈ Rm×m (observation emission)

O(i , j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m (collected emissions)

α(i) = Pr[H1 = ci ] ∈ Rm (initial probabilites)

The choice of an operator Aba depends only on observable symbols . . .

. . . but operator parameters are conditioned by hidden states



A Learnable Set of Observable Operators

Note that for any invertible Q ∈ Rm×m

Pr[ y | x ] = 1>Q−1 (Q Aytxt Q−1) · · · (Q Ay1

x1 Q−1) Q α

Idea(subspace identification methods for linear systems, ’80s)

Find a basis for the state space such that operators in the new basisare related to observable quantities

Following multiplicity automata and spectral HMM learning . . .



A Learnable Set of Observable OperatorsFind a basis Q where operators can be expressed in terms of unigram,bigram and trigram probabilities

ρ(i) = Pr[Y1 = bi ] ∈ Rl

P(i , j) = Pr[Y1 = bj ,Y2 = bi ] ∈ Rl×l

Pba (i , j) = Pr[Y1 = bj ,Y2 = b,Y3 = bi |X2 = a] ∈ Rl×l

Theorem (ρ, P and Pba are sufficient statistics)

Let P = UΣV ∗ be a thin SVD decomposition, then Q = U>O yields(under certain assumptions)

Q α = U>ρ

1>Q−1 = ρ>(U>P)+

Q Aba Q−1 = (U>Pb

a )(U>P)+


Learning Observable Operator Models

Spectral Learning Algorithm

GivenI Input X and output Y alphabetI Number of hidden states mI Training sample S = {(x1, y1), . . . , (xn, yn)}

DoI Compute unigram ρ, bigram P and trigram Pb

a relative frequenciesin S

I Perform SVD on P and take U with top m left singular vectorsI Return operators computed using ρ, P, Pb

a and UIn Time

I O(n) to compute relative frequenciesI O(|Y|3) to compute SVD


Learning Observable Operator Models

PAC-Style ResultI Input distribution DX over X ∗ with λ = E[|X |], µ = mina Pr[X2 = a]I Conditional distributions DY |x on Y∗ given x ∈ X ∗ modeled by an

FST with m states (satisfying certain rank assumptions)I Sampling i.i.d. from joint distribution DX ⊗ DY |X

TheoremFor any 0 < ε, δ < 1, if the algorithm receives a sample of size

n ≥ O

(λ2m|Y|ε4µσ2

Oσ4P

log|X |δ

), (σO and σP are mth singular

values of O and P in target)

then with probability at least 1− δ the hypothesis DY |x satisfies

EX

∑y∈Y∗

∣∣∣DY |X (y)− DY |X (y)∣∣∣ ≤ ε . (L1 distance between

joint distributionsDX ⊗ DY |X and

DX ⊗ DY |X )


Experimental Evaluation

Synthetic Experiments

Goal: Compare against baselines when learning hypothesis hold

Target: Randomly generated with |X | = 3, |Y| = 3, |H| = 2

32 128 512 2048 8192 327680

0.1

0.2

0.3

0.4

0.5

0.6

0.7

# training samples (in thousands)

L1

dis

tan

ce

HMMk−HMMFST

I HMM: model input-outputjointly

I k -HMM: one model for eachinput symbol

I Results averaged over 5 runs


Experimental Evaluation

Transliteration Experiments

Goal: Compare against EM in a real task (where modeling assumptions fail)

Task: English to Russian transliteration (brooklyn→ áðóêëèí)

75 150 350 750 1500 3000 600020

30

40

50

60

70

80

# training sequences

no

rma

lize

d e

dit d

ista

nce

Spectral, m=2Spectral, m=3EM, m=2EM, m=3

Training timesSpectral 26 sEM (iteration) 37 sEM (best) 1133 s

I Sequence alignment done inpreprocessing

I Standard techniques used forinference

I Test size: 943, |X | = 82, |Y| = 34


Dependency Parsing

Dependency Parsing

a new todaymovieJohn saw*

I Directed arcs represent dependencies between a head word anda modifier word.

I E.g.:John modifies saw,movie modifies saw,new modifies movie


Dependency Parsing

Probabilistic Dependency Parsing

saw*

Pr[tree] = Pr[saw |?, RIGHT]

× Pr[John|saw , LEFT]

× Pr[movie, today |saw , RIGHT]

× Pr[a,new |movie, LEFT]

× Pr[ε|movie, RIGHT]× Pr[ε|John, RIGHT]× . . .


Dependency Parsing


John saw*

Pr[tree] = Pr[saw |?, RIGHT]× Pr[John|saw , LEFT]





Dependency Parsing


todaymovieJohn saw*






Dependency Parsing








Dependency Parsing








Dependency Parsing


I Dependency trees factor into head-modifier sequences:

Pr[tree] =∏

〈h,d ,m1:T 〉∈y

Pr[m1:T |h,d ]

I State-of-the-art models further assume that:

Pr[m1:T |h,d ] =T∏

t=1

Pr[mt |h,d ]

I In this work:I We model the dynamics of modifier sequences with hidden

structureI We use spectral methods to induce hidden structure


Dependency Parsing


I Observable Operator Model:

Pr(m1:T |h,d) = α>∞ Ah,dmT· · · Ah,d

m1α1

I Statistics for estimating the model:

unigrams: p(i) = Pr[mt = i]bigrams: P(i , j) = Pr[mt = j ,mt+1 = i]

trigrams: Ph,dx (i , j) = Pr[mt−1 = j ,mt =x ,mt+1 = i |h,d ]


Dependency Parsing

Spectral Algorithm

1. Use TRAIN to estimate p, P, and Ph,dm

2. Compute a singular value decomposition of P.Take U to be the matrix of top S left singular vectors of P.

3. Create operators, for each h, d , and m :

α1 = U p (1)

Ah,dm = U>Ph,d

m

(U>P

)+(2)

α>∞ = p>(

U>P)+

(3)


Dependency Parsing

Parsing Sentences

I Task: given a sentence x0:n recover its dependency treeI We compute

y = argmaxy∈Y(x0:n)

∏(xh,xm)∈y

µh,m

whereµi,j = Pr[(xi , xj ]|x0:n) =

∑y∈Y(x0:n) : (xi ,xj )∈y

Pr[y ]

I y and µi,j can be computed in O(n3) with variants of the CKYalgorithm for context-free grammars.


Dependency Parsing

Experiment 1: Spectral vs. EM

Fully unlexicalized parsing

Score curve on English development set.


Dependency Parsing

Results on Multiple Languages

Fully unlexicalized parsing

Det Spectral Det+F Spectral+FEnglish 63.76 78.25 72.93 80.83Danish 61.70 77.25 69.70 78.16Dutch 54.30 61.32 60.90 64.01

Portuguese 71.39 85.33 81.47 85.71Slovene 60.31 66.71 65.63 68.65Swedish 69.09 79.55 76.94 80.54Turkish 53.79 62.56 57.56 63.04


Conclusion

Summary of Contributions

I Fast spectral method for learning input-output OOM

I Strong theoretical guarantees with few assumptions on inputdistribution

I Outperforming previous spectral algorithms on FST

I New tools for inducing sequential latent structure for parsing

I Faster and better than EM in some real tasks


Conclusion

Future (actually, current) work

I Problem: For large alphabets, we might need unrealistically largesamples to compute necessary statistics.Solution: Apply smoothing techniques.

I Problem: In many real tasks symbols are not atomic (e.g. words).Solution: Define OOM models that handle features of the symbols(e.g. morphology, pos tags).

I Problem: Combining multiple models.Solution: Boosting.


Spectral Learning Methodsfor Finite-State Machines,

with Applications to Natural Language Processing

Borja Balle, Xavier Carreras, Franco Luque, Ariadna Quattoni

Traducció de la marca a altres idiomes11

La marca es pot traduir a altres idiomes, excepte el nom de la Universitat, que no és traduïble.

September 2011


Technical AssumptionsX = {a1, . . . ,ak},Y = {b1, . . . ,bl},H = {c1, . . . , cm}

Parameters

Ta(i , j) = Pr[Hs = ci |Xs−1 = a,Hs−1 = cj ] ∈ Rm×m (state transition)

T =∑

a Ta Pr[X1 = a] ∈ Rm×m (“mean” transition matrix)

O(i , j) = Pr[Ys = bi |Hs = cj ] ∈ Rl×m (collected emissions)

α(i) = Pr[H1 = ci ] ∈ Rm (initial probabilites)

Assumptions1. l ≥ m2. α > 03. rank(T ) = rank(O) = m4. mina Pr[X2 = a] > 0


Spectral Learning Methods for Finite State Machines with Applications to Natural Language Tagging and Parsing

Technology