Top Banner
Embracing Language Diversity
36

Embracing Language Diversity

Nov 23, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Embracing Language Diversity

Embracing Language Diversity

Page 2: Embracing Language Diversity

More than 4,000 live languages

Most are resource-poor

Key Questions

2

Can we improve monolingual performance byexploiting multilingual connections?

Page 3: Embracing Language Diversity

Multilingual Learning

Linguistic Motivation:

Languages related structurally and genetically

But differ systematically in patterns of expression and ambiguity

Goal:

• Induce individual language structures

• Induce cross-lingual connections

Page 4: Embracing Language Diversity

• Learn from differences in lexical ambiguity

fish/poissons [N] vs. fish/pêcher [V]

• Learn from differences in structural ambiguity (1) determiner “les” signals noun

Motivation for Multilingual Learning

Page 5: Embracing Language Diversity

• Learn from differences in lexical ambiguity

fish/poissons [N] vs. fish/pêcher [V]

• Learn from differences in structural ambiguity (1) determiner “les” signals noun

Motivation for Multilingual Learning

Page 6: Embracing Language Diversity

Multilingual Learningfor POS Tagging

Input:Untagged bilingual parallel corpus

Goal:Induce a POS tagger for each language(test on monolingual data)

6

Page 7: Embracing Language Diversity

Two Monolingual HMMs

Page 8: Embracing Language Diversity

Align Words

Page 9: Embracing Language Diversity

Merge Nodes of Aligned Words

Page 10: Embracing Language Diversity

Performance

Page 11: Embracing Language Diversity

Bilingual Tagging Performance: Serbian

65

67

69

71

73

75

77

79

81

Mono HU RO SL CS BG ET EN

Page 12: Embracing Language Diversity

Part-of-Speech Tagging Accuracy

Page 13: Embracing Language Diversity

The More The Merrier!

Page 14: Embracing Language Diversity

Beyond Multilingual Tagging

Morphology ParsingPOS Tagging

NN AC CC DTNN

Page 15: Embracing Language Diversity

Proposed Research

• Learn from non-parallel corpus

Benefit from the world’s wealth of language resources

• Move towards language-neutral semantic representation

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Page 16: Embracing Language Diversity

Constrain unsupervised grammar induction using language-independent syntactic rules

Using Linguistic Universals for Structure Analysis

Root Auxiliary Noun Adjective

Root Verb Noun Article

Verb Noun Noun Noun

Verb Pronoun Noun Numeral

Verb Adverb Preposition Noun

Verb Verb Adjective Adverb

Auxiliary Verb

(Naseem et al., EMNLP 2010)

Page 17: Embracing Language Diversity

Using Universal for Structure Analysis

20

30

40

50

60

70

80

English Danish Slovene Spanish Swedish Portuguese

No rules Universal Rules

Page 18: Embracing Language Diversity

Model Posterior

Adding the Universal Rules

Parses of data◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

18

Page 19: Embracing Language Diversity

Model Posterior

Count(edges ∈ rules) ... 1 … … … 3 … …

╳ 0.005 ╳ 0.01

Adding the Universal Rules

Posterior Probability

Parses of data◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

19

0.005

0.01

Page 20: Embracing Language Diversity

Model Posterior

Count(edges ∈ rules) ... 1 … … … 3 … …

╳ 0.005 ╳ 0.01

= (… + 0.005 + … + … + 0.03 + …) = 2.79 E[edges ∈ rules]

Adding the Universal Rules

Posterior Probability

Parses of data

20

◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

0.005

0.01

Page 21: Embracing Language Diversity

Model Posterior

Count(edges ∈ rules) ... 1 … … … 3 … …

╳ 0.005 ╳ 0.01

= (… + 0.005 + … + … + 0.03 + …) = 2.79 E[edges ∈ rules]

≥ 0.8 ╳ total edges

Adding the Universal Rules

Posterior Probability

Parses of data

21

◊ Kids eat apples. ◊ Kids eat apples.

Parses of data

Post

erio

r p

rob

abili

ty

…. ….. ……

Pre-specified threshold

0.005

0.01

Page 22: Embracing Language Diversity

The Gap Remains

68.8

71.9

91.5

60

65

70

75

80

85

90

95

Unsupervised Headden III et al.

(2009)

Universal rules Naseem et al.

(2010)

Supervised McDonald et al.

(2006)

Page 23: Embracing Language Diversity

Leverage Language Diversity in Language Analysis

• Typological Analysis: compare languages based on structural patterns (aka typological parameters)‏

• Parameters encode dimensions of language variance

Subject Verb Object Positioning

Number of Genders

Definite Article

23

Page 24: Embracing Language Diversity

English Russian Hebrew

Exponence of Selected Inflectional Formatives

No case Case + number No case

Definite Articles Definite word distinct from demonstrative

No definite or indefinite article

Definite affix

Systems of Gender Assignment

SemanticSemantic and formal

Semantic and formal

Order of Adjective and Noun

Adjective-Noun Adjective-Noun Noun-Adjective

Hand and Arm Different Identical Different

The World Atlas of Language Structures Online2,650 Languages, 142 Features

24

Page 25: Embracing Language Diversity

0

0.1

0.2

0.3

0.4

English

P(.|Verb)

0

0.1

0.2

0.3

0.4

Portuguese

P(.|Verb)

From Typological Tables to Rule Distributions

Page 26: Embracing Language Diversity

0

0.1

0.2

0.3

0.4

0.5

English

P(.|Noun)

0

0.1

0.2

0.3

0.4

0.5

Portuguese

P(.|Noun)

From Typological Tables to Rule Distributions

Page 27: Embracing Language Diversity

Low Density Language

Unsupervised

Resource Rich Language

Supervised

Model for Low Density Language

Typology Reference

p(. | NP)p(. | NP)

KL divergence between p(. | NP)

and p(. | NP)

Proposed Approach: Bilingual Scenario

Page 28: Embracing Language Diversity

Arabic

Low Density Language

Unsupervised

)NP|(p

English

Chinese

Typology Reference

Proposed Approach: Multilingual Scenario

Model for Low Density Language

Page 29: Embracing Language Diversity

Semantic Analysis for Low-densityLanguages

Goal: Construct language-neutral abstract representation

Page 30: Embracing Language Diversity

He smells flowers

pos verb

num singular

transitive yes

time present

smells (x1,x2)

pos verb

num singular

transitive no

time present

smells (x1)

pos noun

num plural

count yes

smells

Semantic Ambiguity

Page 31: Embracing Language Diversity

He smells flowers

pos verb

num singular

transitive yes

time present

smells (x1,x2)

pos verb

num singular

transitive no

time present

smells (x1)

pos noun

num plural

count yes

smells

smells/سونگھتا ہے flowers/پھول he/وہ

سونگھتا ہے بدبو آتی ہےبدبوئیں

ריחסמ מריח תרחו

פרחים /flowersהוא /he מריח /smells

Page 32: Embracing Language Diversity

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

He smells flowers הוא‏מריח‏‏פרחים وہ پھول سونگھتا ہے

Page 33: Embracing Language Diversity

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

• Extract minimal set of frequently occurring fragments

Model with Dirichlet processes (adaptor grammar induction)

He smells flowers הוא‏מריח‏‏פרחים وہ پھول سونگھتا ہے

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Page 34: Embracing Language Diversity

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

• Extract minimal set of frequently occurring fragments

• Learn to semantic parsing in a monolingual setting

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Page 35: Embracing Language Diversity

Construct a Language Neutral Semantic Representation

• Align trees of multi-parallel corpus

• Extract minimal set of frequently occurring fragments

• Learn to semantic parsing in a monolingual setting

• Project representation into low density language via bilingual corpus

num singular

person 1st

animacy yes

he הוא وہ

num singular

transitive yes

time present

smells מריח سونگھتا ہے

num plural

animacy no

flowers پھول פרחים

Page 36: Embracing Language Diversity

Benefits of Multilingual Semantic Representation

• Developing tools with scarce target language annotations

– Reduces need in training data due to abstraction over alternative surface realizations

• Developing tools with no target language annotations

– Supports cross-lingual transfer due to language-neutral features derived from the representation