Top Banner
Graph-based Dependency Parsing (Chu-Liu-Edmonds algorithm) Sam Thomson (with thanks to Swabha Swayamdipta) University of Washington, CSE 490u February 22, 2017
54

Dependency Parsing

Sep 11, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dependency Parsing

Graph-based Dependency Parsing(Chu-Liu-Edmonds algorithm)

Sam Thomson (with thanks to Swabha Swayamdipta)

University of Washington, CSE 490u

February 22, 2017

Page 2: Dependency Parsing

Outline

I Dependency trees

I Three main approaches to parsing

I Chu-Liu-Edmonds algorithm

I Arc scoring / Learning

Page 3: Dependency Parsing

Dependency Parsing - Output

Page 4: Dependency Parsing

Dependency Parsing

TurboParser output fromhttp://demo.ark.cs.cmu.edu/parse?sentence=I%20ate%20the%20fish%20with%20a%20fork.

Page 5: Dependency Parsing

Dependency Parsing - Output Structure

A parse is an arborescence (aka directed rooted tree):

I Directed [Labeled] Graph

I Acyclic

I Single Root

I Connected and Spanning: ∃ directed path from root to everyother word

Page 6: Dependency Parsing

Projective / Non-projective

I Some parses are projective: edges don’t cross

I Most English sentences are projective, but non-projectivity iscommon in other languages (e.g. Czech, Hindi)

Non-projective sentence in English:

and Czech:

Examples from Non-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

Page 7: Dependency Parsing

Dependency Parsing - Approaches

Page 8: Dependency Parsing

Dependency Parsing Approaches

I Chart (Eisner, CKY)I O(n3)I Only produces projective parses

I Shift-reduceI O(n) (fast!), but inexactI “Pseudo-projective” trick can capture some non-projectivity

I Graph-based (MST)I O(n2) for arc-factoredI Can produce projective and non-projective parses

Page 9: Dependency Parsing

Dependency Parsing Approaches

I Chart (Eisner, CKY)I O(n3)I Only produces projective parses

I Shift-reduceI O(n) (fast!), but inexactI “Pseudo-projective” trick can capture some non-projectivity

I Graph-based (MST)I O(n2) for arc-factoredI Can produce projective and non-projective parses

Page 10: Dependency Parsing

Dependency Parsing Approaches

I Chart (Eisner, CKY)I O(n3)I Only produces projective parses

I Shift-reduceI O(n) (fast!), but inexactI “Pseudo-projective” trick can capture some non-projectivity

I Graph-based (MST)I O(n2) for arc-factoredI Can produce projective and non-projective parses

Page 11: Dependency Parsing

Graph-based Dependency Parsing

Page 12: Dependency Parsing

Arc-Factored Model

Every possible labeled directed edge e between every pair of nodesgets a score, score(e).

G = 〈V ,E 〉 =

(O(n2) edges)

Example from Non-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

Page 13: Dependency Parsing

Arc-Factored Model

Every possible labeled directed edge e between every pair of nodesgets a score, score(e).

G = 〈V ,E 〉 =

(O(n2) edges)

Example from Non-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

Page 14: Dependency Parsing

Arc-Factored Model

Best parse is:

A∗ = arg maxA⊆G

s.t. A an arborescence

∑e∈A

score(e)

etc. . .The Chu-Liu-Edmonds algorithm finds this argmax.

Example from Non-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

Page 15: Dependency Parsing

Arc-Factored Model

Best parse is:

A∗ = arg maxA⊆G

s.t. A an arborescence

∑e∈A

score(e)

etc. . .The Chu-Liu-Edmonds algorithm finds this argmax.

Example from Non-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

Page 16: Dependency Parsing

Arc-Factored Model

Best parse is:

A∗ = arg maxA⊆G

s.t. A an arborescence

∑e∈A

score(e)

etc. . .The Chu-Liu-Edmonds algorithm finds this argmax.

Example from Non-projective Dependency Parsing using Spanning Tree Algorithms McDonald et al., EMNLP ’05

Page 17: Dependency Parsing

Chu-Liu-Edmonds

Chu and Liu ’65, On the Shortest Arborescence of a Directed Graph, ScienceSinica

Edmonds ’67, Optimum Branchings, JRNBS

Page 18: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edge

In fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 19: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edgeIn fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 20: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edgeIn fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 21: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edgeIn fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 22: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edgeIn fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 23: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edgeIn fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 24: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edgeIn fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 25: Dependency Parsing

Chu-Liu-Edmonds - Intuition

Every non-ROOT node needs exactly 1 incoming edgeIn fact, every connected component that doesn’t contain ROOT

needs exactly 1 incoming edge

I Greedily pick an incoming edge for each node.

I If this forms an arborescence, great!

I Otherwise, it will contain a cycle C .

I Arborescences can’t have cycles, so we can’t keep every edgein C . One edge in C must get kicked out.

I C also needs an incoming edge.

I Choosing an incoming edge for C determines which edge tokick out

Page 26: Dependency Parsing

Chu-Liu-Edmonds - Recursive (Inefficient) Definition

def maxArborescence(V , E , ROOT ):””” returns best arborescence as a map from each node to its parent ”””for v in V \ ROOT:

bestInEdge[v ]← arg maxe∈inEdges[v ] e.score

if bestInEdge contains a cycle C :# build a new graph where C is contracted into a single nodevC ← new Node()

V ′ ← V ∪ {vC} \ CE ′ ← {adjust(e) for e ∈ E \ C}A← maxArborescence(V ′, E ′, ROOT )return {e.original for e ∈ A} ∪ C \ {A[vC ].kicksOut}

# each node got a parent without creating any cyclesreturn bestInEdge

def adjust(e):e′ ← copy(e)

e′.original← eif e.dest ∈ C :

e′.dest← vCe′.kicksOut← bestInEdge[e.dest]

e′.score← e.score− e′.kicksOut.scoreelif e.src ∈ C :

e′.src← vCreturn e′

Page 27: Dependency Parsing

Chu-Liu-Edmonds

Consists of two stages:

I Contracting (everything before the recursive call)

I Expanding (everything after the recursive call)

Page 28: Dependency Parsing

Chu-Liu-Edmonds - Preprocessing

I Remove every edge incoming to ROOTI This ensures that ROOT is in fact the root of any solution

I For every ordered pair of nodes, vi , vj , remove all but thehighest-scoring edge from vi to vj

Page 29: Dependency Parsing

Chu-Liu-Edmonds - Contracting Stage

I For each non-ROOT node v , set bestInEdge[v ] to be itshighest scoring incoming edge.

I If a cycle C is formed:I contract the nodes in C into a new node vCI edges outgoing from any node in C now get source vCI edges incoming to any node in C now get destination vCI For each node u in C , and for each edge e incoming to u from

outside of C :I set e.kicksOut to bestInEdge[u], andI set e.score to be e.score− e.kicksOut.score.

I Repeat until every non-ROOT node has an incoming edge andno cycles are formed

Page 30: Dependency Parsing

An Example - Contracting Stage

V1

ROOT

V3V2

a : 5 b : 1 c : 1

f : 5d : 11

h : 9

e : 4

i : 8g : 10

bestInEdge

V1V2V3

kicksOutabcdefghi

Page 31: Dependency Parsing

An Example - Contracting Stage

V1

ROOT

V3V2

a : 5 b : 1 c : 1

f : 5d : 11

h : 9

e : 4

i : 8g : 10

bestInEdge

V1 gV2V3

kicksOutabcdefghi

Page 32: Dependency Parsing

An Example - Contracting Stage

V1

ROOT

V3V2

a : 5 b : 1 c : 1

f : 5d : 11

h : 9

e : 4

i : 8g : 10

bestInEdge

V1 gV2 dV3

kicksOutabcdefghi

Page 33: Dependency Parsing

An Example - Contracting Stage

V1

ROOT

V3V2

a : 5− 10 b : 1− 11 c : 1

f : 5d : 11

h : 9− 10

e : 4

i : 8− 11g : 10

V4

bestInEdge

V1 gV2 dV3

kicksOuta gb dcdefgh gi d

Page 34: Dependency Parsing

An Example - Contracting Stage

V4

ROOT

V3

b : −10 c : 1

f : 5

a : −5

h : −1

e : 4

i : −3

bestInEdge

V1 gV2 dV3V4

kicksOut

a gb dcdefgh gi d

Page 35: Dependency Parsing

An Example - Contracting Stage

V4

ROOT

V3

b : −10 c : 1

f : 5

a : −5

h : −1

e : 4

i : −3

bestInEdge

V1 gV2 dV3 fV4

kicksOut

a gb dcdefgh gi d

Page 36: Dependency Parsing

An Example - Contracting Stage

V4

ROOT

V3

b : −10 c : 1

f : 5

a : −5

h : −1

e : 4

i : −3

bestInEdge

V1 gV2 dV3 fV4 h

kicksOut

a gb dcdefgh gi d

Page 37: Dependency Parsing

An Example - Contracting Stage

V4

ROOT

V3

b : −10−−1 c : 1− 5

f : 5

a : −5−−1

h : −1

e : 4

i : −3

V5

bestInEdge

V1 gV2 dV3 fV4 hV5

kicksOut

a g, hb d, hc fdefgh gi d

Page 38: Dependency Parsing

An Example - Contracting Stage

V5

ROOT

b : −9

a : −4 c : −4

bestInEdge

V1 gV2 dV3 fV4 hV5

kicksOut

a g, hb d, hc fde ffgh gi d

Page 39: Dependency Parsing

An Example - Contracting Stage

V5

ROOT

b : −9

a : −4 c : −4

bestInEdge

V1 gV2 dV3 fV4 hV5 a

kicksOut

a g, hb d, hc fde ffgh gi d

Page 40: Dependency Parsing

Chu-Liu-Edmonds - Expanding Stage

After the contracting stage, every contracted node will haveexactly one bestInEdge. This edge will kick out one edge insidethe contracted node, breaking the cycle.

I Go through each bestInEdge e in the reverse order that weadded them

I lock down e, and remove every edge in kicksOut(e) frombestInEdge.

Page 41: Dependency Parsing

An Example - Expanding Stage

V5

ROOT

b : −9

a : −4 c : −4

bestInEdge

V1 gV2 dV3 fV4 hV5 a

kicksOut

a g, hb d, hc fde ffgh gi d

Page 42: Dependency Parsing

An Example - Expanding Stage

V5

ROOT

b : −9

a : −4 c : −4

bestInEdge

V1 a �gV2 dV3 f

V4 a �hV5 a

kicksOut

a g, hb d, hc fde ffgh gi d

Page 43: Dependency Parsing

An Example - Expanding Stage

V4

ROOT

V3

b : −10 c : 1

f : 5

a : −5

h : −1

e : 4

i : −3

bestInEdge

V1 a �gV2 dV3 f

V4 a �hV5 a

kicksOut

a g, hb d, hc fde ffgh gi d

Page 44: Dependency Parsing

An Example - Expanding Stage

V4

ROOT

V3

b : −10 c : 1

f : 5

a : −5

h : −1

e : 4

i : −3

bestInEdge

V1 a �gV2 dV3 f

V4 a �hV5 a

kicksOut

a g, hb d, hc fde ffgh gi d

Page 45: Dependency Parsing

An Example - Expanding Stage

V1

ROOT

V3V2

a : 5 b : 1 c : 1

f : 5d : 11

h : 9

e : 4

i : 8g : 10

bestInEdge

V1 a �gV2 dV3 f

V4 a �hV5 a

kicksOut

a g, hb d, hc fde ffgh gi d

Page 46: Dependency Parsing

An Example - Expanding Stage

V1

ROOT

V3V2

a : 5 b : 1 c : 1

f : 5d : 11

h : 9

e : 4

i : 8g : 10

bestInEdge

V1 a �gV2 dV3 f

V4 a �hV5 a

kicksOut

a g, hb d, hc fde ffgh gi d

Page 47: Dependency Parsing

Chu-Liu-Edmonds - Notes

I This is a greedy algorithm with a clever form of delayedback-tracking to recover from inconsistent decisions (cycles).

I CLE is exact: it always recovers the optimal arborescence.

Page 48: Dependency Parsing

Chu-Liu-Edmonds - Notes

I Efficient implementation:Tarjan ’77, Finding Optimum Branchings, Networks

Not recursive. Uses a union-find (a.k.a. disjoint-set) datastructure to keep track of collapsed nodes.

I Even more efficient:Gabow et al. ’86, Efficient Algorithms for Finding Minimum Spanning

Trees in Undirected and Directed Graphs, Combinatorica

Uses a Fibonacci heap to keep incoming edges sorted.Finds cycles by following bestInEdge instead of randomlyvisiting nodes.Describes how to constrain ROOT to have only one outgoingedge

Page 49: Dependency Parsing

Chu-Liu-Edmonds - Notes

I Efficient (wrong) implementation:Tarjan ’77, Finding Optimum Branchings*, Networks

*corrected in Camerini et al. ’79, A note on finding optimum branchings,

Networks

Not recursive. Uses a union-find (a.k.a. disjoint-set) datastructure to keep track of collapsed nodes.

I Even more efficient:Gabow et al. ’86, Efficient Algorithms for Finding Minimum Spanning

Trees in Undirected and Directed Graphs, Combinatorica

Uses a Fibonacci heap to keep incoming edges sorted.Finds cycles by following bestInEdge instead of randomlyvisiting nodes.Describes how to constrain ROOT to have only one outgoingedge

Page 50: Dependency Parsing

Chu-Liu-Edmonds - Notes

I Efficient (wrong) implementation:Tarjan ’77, Finding Optimum Branchings*, Networks

*corrected in Camerini et al. ’79, A note on finding optimum branchings,

Networks

Not recursive. Uses a union-find (a.k.a. disjoint-set) datastructure to keep track of collapsed nodes.

I Even more efficient:Gabow et al. ’86, Efficient Algorithms for Finding Minimum Spanning

Trees in Undirected and Directed Graphs, Combinatorica

Uses a Fibonacci heap to keep incoming edges sorted.Finds cycles by following bestInEdge instead of randomlyvisiting nodes.Describes how to constrain ROOT to have only one outgoingedge

Page 51: Dependency Parsing

Arc Scoring / Learning

Page 52: Dependency Parsing

Arc Scoring

Featurescan look at source (head), destination (child), and arc label.For example:

I number of words between head and child,

I sequence of POS tags between head and child,

I is head to the left or right of child?

I vector state of a recurrent neural net at head and child,

I vector embedding of label,

I etc.

Page 53: Dependency Parsing

Learning

Recall that when we have a parameterized model, and we have adecoder that can make predictions given that model. . .

we can use structured perceptron, or structured hinge loss:

Lθ(xi , yi ) = maxy∈Y{scoreθ(y) + cost(y , yi )} − scoreθ(yi )

Page 54: Dependency Parsing

Learning

Recall that when we have a parameterized model, and we have adecoder that can make predictions given that model. . .we can use structured perceptron, or structured hinge loss:

Lθ(xi , yi ) = maxy∈Y{scoreθ(y) + cost(y , yi )} − scoreθ(yi )