Labeled Grammar Induction with Minimal Supervision Yonatan Bisk, Christos Christodoulopoulos, and Julia Hockenmaier University of Illinois at Urbana-Champaign References: Creutz & Lagus. Morfessor in the Morpho challenge. Proc of PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes. 2006 Christodoulopoulos et al. A Bayesian mixture model of PoS induction using multiple features. Proc of EMNLP 2011 Bisk & Hockenmaier 2012 Simple Robust Grammar Induction with Combinatory Categorial Grammars. Proc of AAAI 2012 Bisk & Hockenmaier 2013 An HDP Model for Inducting Combinatory Categorial Grammars. Trans. of the ACL. 2013 Bisk & Hockenmaier 2015 Probing the Linguistic Strengths and Limitations of Unsupervised Grammar Induction. Proc of ACL 2015 Most approaches to grammar induction are based on the assumption that gold POS tags are available to the induction system. POS tags are arbitrary, relatively clean, clusters, which we replace with induced clusters. (S\N)/N Arg1 (S\N)/N Arg2 John ate sushi with chopsticks s C 3 C 21 C 15 C 7 C 29 (S\S)/N Arg1 (S\S)/N Arg2 This paper: Labeled dependencies without POS tags John ate sushi with chopsticks s NNP VBD NN IN NNS Dependency Grammar Induction: Unlabeled Dependencies from POS tags (S\N)/N Arg1 (S\N)/N Arg2 John ate sushi with chopsticks s NNP VBD NN IN NNS (S\S)/N Arg1 (S\S)/N Arg2 CCG Induction (Bisk & Hockenmaier 2013; 2015): Labeled Dependencies from POS tags the, its, their, his, these, our, Robert, my, your, every, His, Hurricane, Sir, Their, Freddie, Dean, Du, Tom, Jim, Remic, Roger, Gary, Ronald, Kenneth, Alex, Bruce, Litigation, Jay, Alfred, Ad, CS, Andrew, negotiable, Thrift, Patrick, Allied, Speaker, … O ’s, is, was, are, has, were, had , rose, fell, ‘re, ended, expects, whose, ‘ve, remains, gained, owns, includes, became, jumped, holds, takes, provides, climbed, grew, gets, operates, sells, tumbled, seeks, becomes, begins, eased, allowed, helps, … V We use the Bayesian Mixture of Multinomials model (BMMM) of Christodoulopoulos et al. 2011 to induce word clusters. BMMM performs a type-based clustering based on token-level features and automatically inferred morphology [Morfessor (Creutz & Lagus 2006)]. Based on the Universal POS tags of the three most common words, clusters are labeled as N(oun), V(erb) or O(ther). Data: • CCGbanks: English and Chinese • Dependency Corpora: 10 PASCAL Challenge Languages Metric: Recall from majority vote cluster labeling from 3 annotated words per cluster. 0 25 50 75 100 English Chinese PASCAL Noun Recall Verb Recall Other Recall 1. Induce and Label Clusters: Noun, Verb, Other shares, sales, business , companies, prices, investors, them, people, bonds, stocks, earnings, officials, income, rates, markets, analysis, products, funds, operations, growth, banks, issues, costs, concern, traders, him, assets, loans, firms, results , here, … N C 29 C 13 C 25 Hertz equipment is a major supplier of rental equipment Labels R0 N S N N N N R1 N/N S/S S\N S\S N/N N\N N\N N/N N\N N\N N/N N/N R2 (S/S)/(S/S) (N/N)\(N/N) (S/S)\(S/S) (N/N)/(N/N) (N\N)/(N\N) (N/N)\(N/N) (N\N)\(N\N) (N\N)/(N\N) (N/N)\(N/N) (N\N)/(N\N) (S\S)/(S\S) (N\N)\(N\N) (N/N)/N (N\N)\N (N/N)/(N/N) (N\N)\(N\N) (N\N)\N (S/S)/(S\N) (N/N)/(N/N) (S\S)/N (S\S)\(S\S) (N/N)/(N/N) (N\N)/N (N/N)\(N/N) (S\N)\N (S\S)\S (N/N)\(N/N) (N\N)/(N\N) (N/N)\N (N/N)/N (N\N)\N (N/N)/N (N/N)/N (N/N)\S (N\N)\N (N/N)\(S\N) (S\S)\(S\N) shares, sales, business , companies, prices, investors, them, people, bonds, stocks, earnings, officials, income, rates, markets, analysis, products, funds, operations, growth, banks, issues, costs, concern, traders, him, assets, loans, firms, results , here, … N the, its, their, his, these, our, Robert, my, your, every, His, Hurricane, Sir, Their, Freddie, Dean, Du, Tom, Jim, Remic, Roger, Gary, Ronald, Kenneth, Alex, Bruce, Litigation, Jay, Alfred, Ad, CS, Andrew, negotiable, Thrift, Patrick, Allied, Speaker, … O ’s, is, was, are, has, were, had , rose, fell, ‘re, ended, expects, whose, ‘ve, remains, gained, owns, includes, became, jumped, holds, takes, provides, climbed, grew, gets, operates, sells, tumbled, seeks, becomes, begins, eased, allowed, helps, V shares, sales, business , companies, prices, investors, them, people, bonds, stocks, earnings, officials, income, rates, markets, analysis, products, funds, operations, growth, banks, issues, costs, concern, traders, him, assets, loans, firms, results , here, … N shares, sales, business , companies, prices, investors, them, people, bonds, stocks, earnings, officials, income, rates, markets, analysis, products, funds, operations, growth, banks, issues, costs, concern, traders, him, assets, loans, firms, results , here, … N shares, sales, business , companies, prices, investors, them, people, bonds, stocks, earnings, officials, income, rates, markets, analysis, products, funds, operations, growth, banks, issues, costs, concern, traders, him, assets, loans, firms, results , here, … N shares, sales, business , companies, prices, investors, them, people, bonds, stocks, earnings, officials, income, rates, markets, analysis, products, funds, operations, growth, banks, issues, costs, concern, traders, him, assets, loans, firms, results , here, … N the, its, their, his, these, our, Robert, my, your, every, His, Hurricane, Sir, Their, Freddie, Dean, Du, Tom, Jim, Remic, Roger, Gary, Ronald, Kenneth, Alex, Bruce, Litigation, Jay, Alfred, Ad, CS, Andrew, negotiable, Thrift, Patrick, Allied, Speaker, … O the, its, their, his, these, our, Robert, my, your, every, His, Hurricane, Sir, Their, Freddie, Dean, Du, Tom, Jim, Remic, Roger, Gary, Ronald, Kenneth, Alex, Bruce, Litigation, Jay, Alfred, Ad, CS, Andrew, negotiable, Thrift, Patrick, Allied, Speaker, … O 2. Induce a Grammar and Learn Labeled Dependencies CCG Induction: Nouns can have the CCG category N, verbs can have the CCG category S, and may take adjacent nouns as arguments (S\N, S/N, (S\N)/N, etc.). All words can modify (X|X) adjacent N and S. Hertz equipment is a major supplier of rental equipment C 3 C 29 C 21 C 16 C 9 C 29 C 7 C 9 C 29 N/N N S\N (S\S)/N N/N N (N\N)/N N/N N S\N arg1 N/N arg1 N/N arg1 (S\S)/N arg2 (N\N)/N arg2 N/N arg1 (N\N)/N (S\S)/N arg1 arg1 We train a parsing model (Bisk & Hockenmaier 2013;2015) on the induced parse forests. The parser returns CCG derivations and hence labeled dependencies. 0 20 40 60 80 Czech English CHILDES Portuguese Dutch Basque Swedish Slovene Danish Arabic Average This Work Bisk & Hockenmaier 2013 (Gold POS) 0 10 20 30 40 English Chinese This Work Bisk & Hockenmaier 2015 (Gold POS) 3. Parsing Evaluation Labeled F1 on CCGbank Directed Attachments on Dependency Treebanks Bisk & Hockenmaier 2015 produce labeled dependencies with an unsupervised CCG system based on gold POS tags. We show that performance degrades only slightly (less than 1/3 on average) with induced word clusters. Analysis & Future Work Every language poses its own challenges. In panel 2 we see that identifying verbs proves difficult in Chinese. Additionally, in panel 4 we find the largest gaps in languages with rich morphology. Better clustering or feedback from the syntax may help address these issues.
1
Embed
Labeled Grammar Induction with Minimal SupervisionLabeled Grammar Induction with Minimal Supervision Yonatan Bisk, Christos Christodoulopoulos, and Julia Hockenmaier University of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Labeled Grammar Induction with Minimal SupervisionYonatan Bisk, Christos Christodoulopoulos, and Julia Hockenmaier
University of Illinois at Urbana-Champaign
References:Creutz & Lagus. Morfessor in the Morpho challenge. Proc of PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes. 2006Christodoulopoulos et al. A Bayesian mixture model of PoS induction using multiple features. Proc of EMNLP 2011Bisk & Hockenmaier 2012 Simple Robust Grammar Induction with Combinatory Categorial Grammars. Proc of AAAI 2012Bisk & Hockenmaier 2013 An HDP Model for Inducting Combinatory Categorial Grammars. Trans. of the ACL. 2013Bisk & Hockenmaier 2015 Probing the Linguistic Strengths and Limitations of Unsupervised Grammar Induction. Proc of ACL 2015
Most approaches to grammar induction are based on the assumption that gold POS tags are available to the induction system. POS tags are arbitrary, relatively clean, clusters, which we replace with induced clusters.
(S\N)/N Arg1 (S\N)/N Arg2
John ate sushi with chopstickssC3 C21 C15 C7 C29
(S\S)/N Arg1
(S\S)/N Arg2
This paper:Labeled dependencies without POS tags
John ate sushi with chopstickssNNP VBD NN IN NNS
Dependency Grammar Induction:Unlabeled Dependencies from POS tags
We use the Bayesian Mixture of Multinomials model (BMMM) of Christodoulopoulos et al. 2011 to induce word clusters. BMMM performs a type-based clustering based on token-level features and automatically inferred morphology [Morfessor (Creutz & Lagus 2006)]. Based on the Universal POS tags of the three most common words, clusters are labeled as N(oun), V(erb) or O(ther).
Data:• CCGbanks: English and Chinese• Dependency Corpora:
10 PASCAL Challenge Languages
Metric: Recall from majority vote cluster labeling from 3 annotated words per cluster. 0
2. Induce a Grammar and Learn Labeled DependenciesCCG Induction: Nouns can have the CCG category N, verbs can have the CCG category S, and may take adjacent nouns as arguments (S\N, S/N, (S\N)/N, etc.). All words can modify (X|X) adjacent N and S.
Hertz equipment is a major supplier of rental equipmentC3 C29 C21 C16 C9 C29 C7 C9 C29
N/N N S\N (S\S)/N N/N N (N\N)/N N/N N
S\N arg1N/N arg1 N/N arg1
(S\S)/N arg2 (N\N)/N arg2N/N arg1
(N\N)/N (S\S)/Narg1arg1
We train a parsing model (Bisk & Hockenmaier 2013;2015) on the induced parse forests.The parser returns CCG derivations and hence labeled dependencies.
0
20
40
60
80
Czech English CHILDES Portuguese Dutch Basque Swedish Slovene Danish Arabic Average
This Work Bisk & Hockenmaier 2013 (Gold POS)
0
10
20
30
40
English Chinese
This WorkBisk & Hockenmaier 2015 (Gold POS)
3. Parsing Evaluation
Labeled F1 on CCGbank Directed Attachments on Dependency Treebanks
Bisk & Hockenmaier 2015 produce labeled dependencies with an unsupervised CCG system based on gold POS tags. We show that performance degrades only slightly (less than 1/3 on average) with induced word clusters.
Analysis & Future WorkEvery language poses its own challenges. In panel 2 we see that identifying verbs proves difficult in Chinese. Additionally, in panel 4 we find the largest gaps in languages with rich morphology. Better clustering or feedback from the syntax may help address these issues.