Top Banner
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September 26, 2012
38

Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Dec 15, 2015

Download

Documents

Monica Bagg
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

UnsupervisedDependency Parsing

David Mareček

Institute of Formal and Applied LinguisticsCharles University in Prague

Doctoral thesis defenseSeptember 26, 2012

Page 2: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Outline

Unsupervised dependency parsing What is it? What is it good for?

My work Reducibility feature Dependency model Gibbs sampling algorithm for dependency trees Results

Page 3: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Supervised Parser is learned on a manually annotated treebank

Unsupervised No treebanks, no language specific linguistic rules Only corpora without manual tree annotations

Semi-supervised Something in the middle

Dependency parsing

My grandmother plays computer games .

PRP$ NN VBZ NN NNS .

Page 4: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Unsupervised dependency parsing

Induction of linguistic structure directly from text corpus based on language independent linguistic assumptions about

dependencies sometimes called “grammar induction”

We can use it for any language and domain We do not need any new manually annotated treebanks Independent on linguistic theory

We can tune it with respect to the final application E.g. in Machine translation: We do not know what stucture is the best for a particular language pair It can be different from the structures used in treebanks.

It’s a challenge... Children do not use treebanks when learning their mother tongue. Could machines do it as well?

Page 5: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

REDUCIBILITY

Page 6: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Reducibility

Definition: A word (or a sequence of words) in a sentence is reducible if it can be removed from the sentence without violating its correctness.

Some conference participants missed the last bus yesterday.

Some participants missed the last bus yesterday.

Some conference participants the last bus yesterday.

REDUCIBLE NOT REDUCIBLE

Page 7: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Hypothesis

If a word (or sequence of words) is reducible in a particular sentence, it is a leaf (or a subtree) in its dependency structure.

Some conference

participants

missed

the last

bus yesterday

Page 8: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

It mostly holds across languages Problems occur mainly with function words

PREPOSITIONAL PHRASES: They are at the conference.

DETERMINERS: I am in the pub.

AUXILIARY VERBS: I have been sitting there.

Let’s try to recognize reducible words automatically...

Hypothesis

If a word (or sequence of words) is reducible in a particular sentence, it is a leaf (or a subtree) in its dependency structure.

Page 9: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Recognition of reducible words

We remove the word from the sentence.

But how can we automatically recognize whether the rest of the sentence is correct or not? Hardly... (we don’t have any grammar yet)

If we have a large corpus, we can search for the needed sentence. it is in the corpus -> it is (possibly) grammatical it is not in the corpus -> we do not know

We will find only a few words reducible... very low recall

Page 10: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Other possibilities?

Could we take a smaller context than the whole sentence? Does not work at all for free word-order languages.

Why don’t use part-of-speech tags instead of words? DT NN VBS IN DT NN . DT NN VBS DT NN . ... but the preposition IN should not be reducible

Solution: We use a very sparse reducible words in the corpus for estimating

“reducibility scores” for PoS tags (or PoS tag sequence)

Page 11: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Computing reducibility scores

For each possible PoS unigram, bigram and trigram: Find all its occurrences in the corpus For each such occurence, remove the respective words and search for the rest of

the sentence in the corpus. If it occurs at least once elsewhere in the corpus, the occurence is proclaimed as

reducible. Reducibility of PoS n-gram = relative number of reducible occurences

PRP VBD PRP IN DT NN .

I saw her .

She was sitting on the balcony and wearing a blue dress .

I saw her in the theater .

PRP VBD VBG IN DT NN CC VBG DT JJ NN .

PRP VBD PRP .

R’( “IN DT NN” ) =1

2

Page 12: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Computing reducibility scores

• r(g) ... number of reducible occurences

• c(g) ... number of all the occurences

For each possible PoS unigram, bigram and trigram: Find all its occurrences in the corpus For each such occurence, remove the respective words and search for the rest of

the sentence in the corpus. If it occurs at least once elsewhere in the corpus, the occurence is proclaimed as

reducible. Reducibility of PoS n-gram = relative number of reducible occurences

Page 13: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Examples of reducibility scores

Reducibility scores of the English PoS tags induced from the English Wikipedia corpus

Page 14: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Examples of reducibility scores

Reducibility scores of Czech PoS tags 1st and 2nd position of PDT tag

Page 15: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

DEPENDENCY TREE MODEL

Page 16: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Dependency tree model

Consists of four submodels edge model fertility model distance model reducibility model

Simplification we use only PoS tags, we don’t use word forms (except for

computing reducibility scores)

Page 17: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Edge model

P(dependent tag | edge direction, parent tag) “Rich get richer” principle on dependency edges

Page 18: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Fertility model

P(number left and right children | parent tag) “Rich get richer” principle

Page 19: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Distance model

Longer edges are less probable.

Page 20: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Reducibility model

Probability of a subtree is proportinal to its reducibility score.

Page 21: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Probability of treebank

The probability of the whole treebank, which we want to maximize Multiplication over all models and words in the corpus

Page 22: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

GIBBS SAMPLING OF DEPENDENCY TREES

Page 23: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Gibbs sampling

Initialization A random projective dependency tree is generated for each sentence

Sampling A small changes in dependency structures are being done in many

iterations across the treebank Small changes are chosen randomly with respect to the probability

distribution of the resulting treebanks

Decoding Final trees are built according to the last 100 samples

Page 24: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Gibbs sampling – bracketing notation

Each projective dependency tree can be expressed by a unique bracketing. Each bracket pair belongs to one node and delimits its descendants from

the rest of the sentence. Each bracketed segment contains just one word that is not embedded

deeper; this node is the head of the segment.

root

NN IN

VB

NNDT

DT JJ

RB

(((DT) NN) VB (RB) (IN ((DT) (JJ) NN)))

Page 25: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Gibbs sampling – small change

Choose one non-root node and remove its bracket Add another bracket which does not violate the projectivity

( ((DT) NN) VB (RB) IN ((DT) (JJ) NN))( )

(IN ((DT) (JJ) NN))

((RB) IN ((DT) (JJ) NN))

((RB) IN)

(((DT) NN) VB (RB))

(((DT) NN) VB)

(VB (RB))

(VB)

0.0012

0.0009

0.0011

0.0023

0.0018

0.0004

0.0016

(IN) 0.0006

Page 26: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Gibbs sampling - decoding

After 200 iterations We run MST algorithm Edge weights = occurrences of individual edges in the treebank during

the last 100 sampling iterations The output trees may be possibly non-projective

Page 27: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

EXPERIMENTS AND EVALUATION

Page 28: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Data

Inference and evaluation CoNLL 2006/2007 test data HamleDT treebanks (30 languages)

Estimating reducibility scores Wikipedia corpus (W2C) 85 mil. tokens for English ... 3 mil. tokens for Japanese

Page 29: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Experiments

Different languages

Different combinations and variants of models

Supervised / unsupervised PoS tags POS, CPOS, number of classes

Including / excluding punctuation from training / from evaluation

Different decoding methods

Different evaluation metrics DAS, UAS, NED

Page 30: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Results

Reducibility model is very useful

Reducibility model English German Czech

25.2 23.4 22.4

45.2 38.0 43.8

For some languages, I achieved better results when using unsupervised PoS tags instead of supervised ones

Many mistakes are in punctuation

...

Page 31: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Results

Page 32: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Conclusions I have introduced reducibility feature, which is useful in

unsupervised dependency parsing.

Reducibility scores for individual PoS tag n-grams are computed on a large corpus, the inference itself is done on a smaller data.

I have proposed an algorithm for sampling projective dependency trees.

Better results for 15 out of 20 treebanks compared to the 2011 state-of-the-art

Future work: Employ lexicalized models Improve reducibility – another dealing with function words Parallel unsupervised parsing for machine translation

Page 33: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Thank you !

Page 34: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

ANSWERS

Page 35: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Answers to A. Soegaard’s questions

The aim of the parsing may be: To be able to parse any language using all the data resources

available (McDonald, Petrov,...) To induce a grammar without using any manually annotated data

(Spitkovsky, Blunsom,...)

For a completely unsupervised solution I should use unsupervised PoS tagging as well I would not know what are verbs, what are nouns, ...

Hyperparameter tuning and evaluation In a future work, it should be extrinsic (on a final application, e.g. MT) In my thesis, the only possibility was to evaluate against existed

treebanks

Page 36: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Answers to A. Soegaard’s questions [2]

Decoding: I’ve chosen the maximum-spanning-tree decoding. The results using annealing were not very different Non-projective (Chu-Liu-Edmonds algorithm) I have not tested projective (Eisner’s) algorithm.

Comparing results with other works Many papers report the results on sentences not longer than 10 words.

Turkish 2006 data are missing I did not have this data available.

Page 37: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Answers to F. Jurčíček’s questions

(2) Chinese restaurant process Treebank generation ~ Chinese restaurant

(4), (7) What is the history? When generating a treebank, a new dependency edge is generated

based on previously generated edges When sampling a new treebank, a new edge(s) is sampled based on

all other edges in the treebank (exchangeability)

(5) Are the distance and reducibility models really unsupervised? unsupervised – we do not need any labeled data language independent – they works for all the languages Are the properties of distance and reducibility assumptions or we

observed them form a data? The repeatability of edges could be observed from data as well.

Page 38: Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Answers to F. Jurčíček’s questions [2]

(7) Probability of a dependency relation The proposed sampling algorithm can change more than one edge

together (to preserve the treeness) Probability of the rest of the treebank is equal for all the candidates.

(7) Dependencies in the same tree are not i.i.d. That’s true. I am aware of it. Independency is negligible on a very high number of sentences.

(8) Small changes Described by removing a bracket and adding another bracket. This causes that more than one edge may be changed in one sample.