Top Banner
Discontinuous Parsing with an Efficient and Accurate DOP Model Andreas van Cranenburgh Rens Bod Huygens ING Institute for Logic, Language and Computation Royal Netherlands Academy of Arts and Sciences University of Amsterdam November 27, 2013 IWPT 2013, Nara, Japan
39

Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Oct 15, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Discontinuous Parsing with an Efficientand Accurate DOP Model

Andreas van Cranenburgh Rens Bod

Huygens ING Institute for Logic, Language and ComputationRoyal Netherlands Academy of Arts and Sciences University of Amsterdam

November 27, 2013

IWPT 2013, Nara, Japan

Page 2: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

This talk

Parsing with . . .I discontinuous constituents:

Linear Context-Free Rewriting Systems (LCFRS)I treebank fragments:

Data-Oriented Parsing (DOP)Tree-Substitution Grammar (TSG)

Page 3: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Discontinuous constituents

Example:I Why did the chicken cross the road?I The chicken crossed the road to get to the other side.

Page 4: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A discontinuous tree not found in the Penn treebank.

Page 5: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Discontinuous constituents

Motivation:I Flexible word-orderI Capture argument structureI Combine information from

constituency & dependency structuresI Information is available in treebanks

(German, Dutch, English after conversion).

Page 6: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A discontinuous tree not found in the Penn treebank.

Context-Free Grammar (CFG)NP(ab)→ DT(a) NN(b)

Page 7: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A discontinuous tree not found in the Penn treebank.

Linear Context-Free Rewriting System (LCFRS)VP2(a,bc)→WHADVP(a) VB(b) NP(c)

Page 8: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A discontinuous tree not found in the Penn treebank.

Linear Context-Free Rewriting System (LCFRS)VP2(a,bc)→WHADVP(a) VB(b) NP(c)SQ(abcd)→ VBD(b) NP(c) VP2(a,d)

Page 9: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Linear Context-Free Rewriting Systems

I Mildly context-sensitive grammar formalismI Can be parsed with tabular parsing algorithmI Agenda-based probabilistic parser for LCFRS

(Kallmeyer & Maier 2010);extended to produce k-best derivations

I Parsing a binarized LCFRS has polynomial complexity:

O(n3ϕ)

where ϕ is the maximum number of componentscovered by a non-terminal (fan-out).

Kallmeyer & Maier (2010). Data-driven parsing with probabilistic linearcontext-free rewriting systems.

Page 10: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

But . . .

0 10 20 30 40

length

0

200

400

600

800

1000

1200

Avg. C

PU

tim

e (

seco

nds)

PLCFRS

Negra dev. set, gold tags

Page 11: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

PCFG approximation of PLCFRS

S

B*1 B*2

A X C Y D

a b c b d

S

B

A X C Y D

a b c b d

I Transformation is reversibleI Increased independence assumption:⇒ every component is a new node

I Language is a superset of original PLCFRS⇒ coarser, overgenerating PCFG (‘split-PCFG’)

Boyd (2007). Discontinuity revisited.

Page 12: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Coarse-to-fine pipeline

G0

G1

G2

Split-PCFG

PLCFRS

a largegrammar

Treebankgrammars

Mildlycontext-sensitive

prune parsing with Gm+1 by only consideringitems in k-best Gm derivations.

Page 13: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

With coarse-to-fine

0 10 20 30 40

length

0

200

400

600

800

1000

1200Avg

. C

PU

tim

e (

seco

nd

s)

PLCFRS (k=10,000)Split-PCFGPLCFRS

Negra dev. set, gold tags

Page 14: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Data-Oriented Parsing

Treebank grammartrees⇒ productions + rel. frequencies⇒ problematic independence assumptions

Data-Oriented Parsing (DOP)trees⇒ fragments + rel. frequenciesfragments are arbitrarily sized chunksfrom the corpus

consider all possible fragments from treebank. . .and “let the statistics decide”

Scha (1990): Lang. theory and lang. tech.; competence and performanceBod (1992): A computational model of language performance

Page 15: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

DOP fragmentsS

VP2

VB NP ADJis Gatsby rich

S

VP2

VB NP ADJis rich

S

VP2

VB NP ADJGatsby rich

S

VP2

VB NP ADJis Gatsby

S

VP2

VB NP ADJrich

S

VP2

VB NP ADJGatsby

S

VP2

VB NP ADJis

S

VP2

VB NP ADJ

S

VP2

NPGatsby

S

VP2

NPVP2

VB ADJis rich

VP2

VB ADJrich

VP2

VB ADJis

VP2

VB ADJ

NPGatsby

VBis

ADJrich

P(f ) = count(f )∑f ′∈F count(f ′) where F = { f ′ | root(f ′) = root(f ) }

Note: discontinuous frontier non-terminalsmark destination of components

Page 16: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

DOP derivation

S

VP2

VB NP ADJrich

VBis

NPGatsby

S

VP2

VB NP ADJis Gatsby rich P(d) = 0.2

S

VP2

VB NP ADJis

NPGatsby

ADJrich

S

VP2

VB NP ADJis Gatsby rich P(d) = 0.3

Derivations for this tree P(t) = 0.5

P(d) = P(f1 ◦ · · · ◦ fn) =∏f∈d

p(f )

P(t) = P(d1) + · · ·+ P(dn) =∑

d∈D(t)

∏f∈d

p(f )

Page 17: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

DOP implementation issues

Exponential number of fragmentsdue to all-fragments assumption

I Can use DOP reduction (Goodman 2003);weight of fragments spread over many productions

I Can restrict number of fragmentsby depth or frontier nodes &c.,⇒ but: not data-oriented!

Goodman (2003): Efficient parsing of DOP with PCFG-reductions

Page 18: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Double-DOP

I Extract fragments that occurat least twice in treebank

I For every pair of trees,extract maximal overlapping fragments

I Can be extracted in linear average timeI Number of fragments is small enough

to parse with directly

Sangati & Zuidema (2011). Accurate parsing w/compact TSGs: Double-DOP

Page 19: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

From fragments to grammar

I Fragments mapped to unique rules,relative frequencies as probabilities

I Remove internal nodes,leaves root node, substitution sites & terminalsX → X1 . . .Xn

I Reconstruct derivations after parsing

S

VP2

VB NP ADJrich

S

VB NP rich

Sangati & Zuidema (2011). Accurate parsing w/compact TSGs: Double-DOP

Page 20: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Preprocessing

I Remove function labelsI Binarize w/markovization (h=1, v=1)I Simple unknown word model

I Rare words replaced by features(model 4 from Stanford parser)

I Reserve probability massfor unseen (tag, word) pairs

Page 21: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Results w/Double-DOP

F1 %DOP reduction 74.3Double-DOP

(Negra dev set ≤ 40 words, gold tags)

Page 22: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Results w/Double-DOP

F1 %DOP reduction 74.3Double-DOP 76.3

(Negra dev set ≤ 40 words, gold tags)

Also: parsing 3× faster, grammar 3× smaller

Page 23: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Results w/Double-DOP

k=50 k=5000F1 % F1 %

DOP reduction 74.3 73.5Double-DOP 76.3

(Negra dev set ≤ 40 words, gold tags)

What if we reduce pruning?

Page 24: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Results w/Double-DOP

k=50 k=5000F1 % F1 %

DOP reduction 74.3 73.5Double-DOP 76.3 77.7

(Negra dev set ≤ 40 words, gold tags)

What if we reduce pruning?⇒ For Double-DOP, performance does not deterioriatewith expanded search space.

Page 25: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Main Results: test sets

Parser, treebank |w | POS F1 EXGERMAN

vanCra2012, Negra ≤ 40 100 72.3 33.2#KaMa2013, Negra ≤ 30 100 75.8this paper, Negra ≤ 40 100 76.8 40.5this paper, Negra ≤ 40 96.3 74.8 38.7HaNi2008, Tiger ≤ 40 97.0 75.3 32.6this paper, Tiger ≤ 40 97.6 78.8 40.8

KaMa: Kallmeyer & Maier (2013) [different test set];vanCra: van Cranenburgh (2012); HaNi: Hall & Nivre (2008).

Page 26: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Main Results: test sets

ENGLISH#EvKa2011, disc. wsj < 25 100 79.0this paper, disc. wsj ≤ 40 96.6 85.6 31.3SaZu2011, wsj ≤ 40 87.9 33.7

DUTCHthis paper, Alpino ≤ 40 85.2 65.9 23.1this paper, Lassy ≤ 40 94.6 77.0 35.2

EvKa: Evang & Kallmeyer (2011) [different test set];SaZu: Sangati & Zuidema (2011).

Page 27: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Main Results: test sets

ENGLISH#EvKa2011, disc. wsj < 25 100 79.0this paper, disc. wsj ≤ 40 96.6 85.6 31.3SaZu2011, wsj ≤ 40 87.9 33.7

DUTCHthis paper, Alpino ≤ 40 85.2 65.9 23.1this paper, Lassy ≤ 40 94.6 77.0 35.2

EvKa: Evang & Kallmeyer (2011) [different test set];SaZu: Sangati & Zuidema (2011).

Page 28: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Can DOP handle discontuinity without LCFRS?

Split-PCFG⇓

PLCFRS⇓

PLCFRS Double-DOP77.7 % F141.5 % EX

Split-PCFG

Split-Double-DOP

78.1 % F142.0 % EX

Answer: Yes!

Fragments can capture discontinuous contexts

Page 29: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Can DOP handle discontuinity without LCFRS?

Split-PCFG⇓

PLCFRS⇓

PLCFRS Double-DOP77.7 % F141.5 % EX

Split-PCFG

Split-Double-DOP78.1 % F142.0 % EX

Answer: Yes!

Fragments can capture discontinuous contexts

Page 30: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Conclusions

I Multilingual results for discontinuous parsing,w/automatic assignment of tags

I All fragments vs. selected fragmentsI Explicit representation of recurring fragments with

Double-DOP leads to better sample of derivationsthan parsing with all fragments

I Not necessary to parse beyond CFG!⇒ Increase amount of contextthrough fragments / labels

I LCFRS could be exploited for other things thandiscontinuity: adjunction, synchronous parsing, ...

Page 31: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Conclusions

I Multilingual results for discontinuous parsing,w/automatic assignment of tags

I All fragments vs. selected fragmentsI Explicit representation of recurring fragments with

Double-DOP leads to better sample of derivationsthan parsing with all fragments

I Not necessary to parse beyond CFG!⇒ Increase amount of contextthrough fragments / labels

I LCFRS could be exploited for other things thandiscontinuity: adjunction, synchronous parsing, ...

Page 32: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Conclusions

I Multilingual results for discontinuous parsing,w/automatic assignment of tags

I All fragments vs. selected fragmentsI Explicit representation of recurring fragments with

Double-DOP leads to better sample of derivationsthan parsing with all fragments

I Not necessary to parse beyond CFG!⇒ Increase amount of contextthrough fragments / labels

I LCFRS could be exploited for other things thandiscontinuity: adjunction, synchronous parsing, ...

Page 33: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Conclusions

I Multilingual results for discontinuous parsing,w/automatic assignment of tags

I All fragments vs. selected fragmentsI Explicit representation of recurring fragments with

Double-DOP leads to better sample of derivationsthan parsing with all fragments

I Not necessary to parse beyond CFG!⇒ Increase amount of contextthrough fragments / labels

I LCFRS could be exploited for other things thandiscontinuity: adjunction, synchronous parsing, ...

Page 34: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

THE END

Codes: http://github.com/andreasvc/disco-dop

Page 35: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Wait . . . there’s more

BACKUP SLIDES

Page 36: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Efficiency (Negra dev set)

10 20 30 40

# words

0

10

20

30

40

cpu t

ime (

seco

nds)

dopplcfrspcfg

Page 37: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Binarization

I mark heads of constituentsI head-outward binarization (parse head first)I no parent annotation: v = 1I horizontal Markovization: h = 1

X

A B C D E F

X

XA

XB

XF

XE

XD

XC$

A B C D E FKlein & Manning (2003): Accurate unlexicalized parsing.

Page 38: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Parser setuptraincorpus='wsj02-21.export',testcorpus='wsj24.export',corpusdir='../../dptb',stages=[

dict(name='pcfg', mode='pcfg',split=True, markorigin=True,

),dict(

name='plcfrs', mode='plcfrs',prune=True, splitprune=True, k=10000,

),dict(

name='dop', mode='plcfrs',prune=True, k=5000,dop=True, usedoubledop=True, m=10000,estimator='dop1', objective='mpp',

),],[...]

Page 39: Discontinuous Parsing with an Efficient and Accurate DOP Model · 2020. 6. 25. · DiscontinuousParsingwithanEfficient andAccurateDOPModel AndreasvanCranenburgh RensBod HuygensING

Web-based interface