Top Banner
Real-World Semi- Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013
24

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Jan 03, 2016

Download

Documents

Melissa Cain
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Real-World Semi-Supervised Learning of POS-Taggers for

Low-Resource Languages

Dan Garrette, Jason Mielens, and Jason Baldridge

Proceedings of ACL 2013

Page 2: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Semi-Supervised Training

HMM with Expectation-Maximization (EM)

Need:

Large raw corpus

Tag dictionary

[Kupiec, 1992][Merialdo, 1994]

Page 3: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Previous Works: Supervised LearningProvide high accuracy for POS tagging (Manning, 2011).

Perform poorly when little supervision is available.

Semi-SupervisedDone by training sequence models such as HMM using the EM algorithm.

Work in this area has still relied on relativelylarge amounts of data.(Kupiec, 1992; Merialdo,1994).

Page 4: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Previous Works: Goldberg et al.(2008)Manually constructed lexicon for Hebrew to

train HMM tagger.Lexicon was developed over a long period of

time by expert lexicographers. Tackstrom et al. (2013)Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.

Large parallel corpora required.

Page 5: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Low-Resource Languages

6,900 languages in the world

~30 have non-negligible quantities of data

No million-word corpus for anyendangered language

[Maxwell and Hughes, 2006][Abney and Bird, 2010]

Page 6: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Low-Resource Languages

Kinyarwanda (KIN)Niger-Congo.Morphologically-rich.

Malagasy (MLG)Austronesian.Spoken in Madagascar.

Also, English

Page 7: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Collecting Annotations

• Supervised training is not an option.

•Semi-supervised training:

•Annotate some data by hand in 4 hours,

(in 30-minute intervals) for two tasks.

•Type supervision.

•Token supervision.

Page 8: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict Generalization

These annotations are too sparse!

Generalize to the entire vocabulary

Page 9: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict Generalization

Haghighi and Klein (2006) do this witha vector space.

We don’t have enough raw data

Das and Petrov (2011) do this witha parallel corpus.

We don’t have a parallel corpus

Page 10: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict Generalization

Strategy: Label Propagation

• Connect annotations to raw corpus tokens

• Push tag labels to entire corpus

[Talukdar and Crammer. 2009]

Page 11: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Morphological Transducers• Finite-state transducers are used for morphological analysis.

• FST accepts a word type and producesa set of morphological features.

•Power of FSTs:•Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

Page 12: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict GeneralizationPREV_<b> NEXT_thug

TOK_the_4 TOK_the_1

TYPE_the

PREV_the

TOK_the_9 TOK_thug_5

TYPE_thug

NEXT_walks

TOK_dog_2

TYPE_dog

PRE1_t PRE2_th SUF1_e SUF1_g PRE1_d PRE2_do

Page 13: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict GeneralizationType Annotations

_the__DT_____dog_NN____

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Page 14: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict GeneralizationType Annotations

_the_________dog________

TYDTthe

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYNNog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Page 15: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict GeneralizationType Annotations

_the________dog

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Token Annotationsthe dog walksDT NN VBZ

Page 16: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Tag Dict GeneralizationType Annotations

_the________dog

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TODTe_4TOK_the_1 TOK_thug_5

TOKNN_2

Token Annotationsthe dog walks____________

Page 17: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Model Minimization

[Ravi et al., 2010; Garrette and Baldridge, 2012]

• LP graph has a node for each corpus token.• Each node is labelled with distribution over POS tags.•Graph provides a corpus of sentences labelled with noisy tag distributions.

•Greedily seek the minimal set of tagbigrams that describe the raw corpus.•Now use, HMM trained by EM.

Page 18: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Overall Accuracy

KIN usin

g all t

ypes

MLG

using h

alf ty

pes and half

toke

ns

ENG using a

ll typ

es and m

axim

al am

ount of d

ata0.00%

20.00%

40.00%

60.00%

80.00%

100.00%Accuracy

Accuracy

All of these values were achieved using both FST and affix LP features.

Page 19: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Results

Page 20: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Types versus Tokens

Page 21: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Mixing Type and Token Annotations

Page 22: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Morphological Analysis

Page 23: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Annotator Experience

Page 24: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Conclusion•Type Annotations are the most useful input from a linguist.

•We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.