Top Banner
Two Paradigms for Natural-Language Processing Robert C. Moore Senior Researcher Microsoft Research
37
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Moore_slides.ppt

Two Paradigms for Natural-Language Processing

Robert C. MooreSenior ResearcherMicrosoft Research

Page 2: Moore_slides.ppt

Why is Microsoft interested in natural-language processing?

Make computers/software easier to use.

Long term goal: just talk to your computer (Startrek scenario).

Page 3: Moore_slides.ppt

Some of Microsoft’s near(er) term goals in NLP

Better search Help find things on your computer. Help find information on the Internet.

Document summarization Help deal with information overload.

Machine translation

Page 4: Moore_slides.ppt

Why is Microsoft interested in machine translation? Internal: Microsoft is the world’s largest

user of translation services. MT can help Microsoft Translate documents that would otherwise not

be translated – e.g., PSS knowledge base (http://support.microsoft.com/default.aspx?scid=fh;ES-ES;faqtraduccion).

Save money on human translation by providing machine translations as a starting point.

External: Sell similar software/services to other large companies.

Page 5: Moore_slides.ppt

Knowledge engineering vs. machine learning in NLP Biggest debate over the last 15 years in

NLP has been knowledge engineering vs. machine learning.

KE approach to NLP usually involves hand-coding of grammars and lexicons by linguistic experts.

ML approach to NLP usually involves training statistical models on large amounts of annotated or un-annotated text.

Page 6: Moore_slides.ppt

Central problems in KE-based NLP Parsing – determining the syntactic

structure of a sentence. Interpretation – deriving formal

representation of the meaning of a sentence.

Generation – deriving a sentence that expresses a given meaning representation.

Page 7: Moore_slides.ppt

Simple examples of KE-based NLP notations

Phrase-structure grammar:S Np Vp, Np Sue, Np MaryVp V Np, V sees

Syntactic structure:[[Sue]Np [[sees]V [Mary]Np]Vp]S

Meaning representation:[see(E), agt(E,sue), pat(E,mary)]

Page 8: Moore_slides.ppt

Unification Grammar: the pinnacle of the NLP KE paradigm

Provides a uniform declarative formalism.

Can be used to specify both syntactic and semantic analyses.

A single grammar can be used for both parsing and generation.

Supports a variety of efficient parsing and generation algorithms.

Page 9: Moore_slides.ppt

Background: Question formation in EnglishTo construct a yes/no question: Place the tensed auxiliary verb from the

corresponding statement at the front of the clause.

John can see Mary. Can John see Mary?

If there is no tensed auxiliary, add the appropriate form of the semantically empty auxiliary do.

John sees Mary. John does see Mary. Does John see Mary?

Page 10: Moore_slides.ppt

Question formation in English (continued)To construct a who/what question: For a non-subject who/what question, form a

corresponding yes/no question. Does John see Mary?

Replace the noun phrase in the position being questioned with a question noun phrase and move to the front of the clause.

Who does John see ? For a subject who/what question, simply replace

the subject with a question noun phrase. Who sees Mary?

Page 11: Moore_slides.ppt

Example of a UG grammar rule involved in who/what questions

S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).

Page 12: Moore_slides.ppt

Context-free backbone of rule

S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).

Page 13: Moore_slides.ppt

Category subtype features

S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).

Page 14: Moore_slides.ppt

Features for tracking long distance dependencies

S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[] ).

Page 15: Moore_slides.ppt

Semantic features

S1/S_sem ---> [NP/NP_sem, S2/S_sem] :- S1::(cat=s, stype=whq, whgap_in=SL, whgap_out=SL, vgap=[]), NP::(cat=np, wh=y, whgap_in=[], whgap_out=[]), S2::(cat=s, stype=ynq, whgap_in=NP/NP_sem, whgap_out=[], vgap=[]).

Page 16: Moore_slides.ppt

Parsing algorithms for UG Virtually any CFG parsing algorithm can

be applied to UG by replacing identity tests on nonterminals with unification of nonterminals.

UG grammars are Turing complete, so grammars have to be written appropriately for parsing to terminate.

“Reasonable” grammars generally can be parsed in polynomial time, often n3.

Page 17: Moore_slides.ppt

Generation algorithms for UG Since grammar is purely declarative,

generation can be done by “running the parser backwards.”

Efficient generation algorithms are more complicated than that, but still polynomial for “reasonable” grammars and “exact generation.”

Generation taking into account semantic equivalence is worst-case NP-hard, but still can be efficient in practice.

Page 18: Moore_slides.ppt

A Prolog-based UG system to play with Go to

http://www.research.microsoft.com/research/downloads/ Download “Unification Grammar Sentence

Realization Algorithms,” which includes A simple bottom-up parser, Two sophisticated generation algorithms, A small sample grammar and lexicon, A paraphrase demo that

Parses sentences covered by the grammar into a semantic representation.

Generates all sentences that have that semantic representation according to the grammar.

Page 19: Moore_slides.ppt

A paraphrase example?- paraphrase(s(_,'CAT'([]),'CAT'([]),'CAT'([])), [what,direction,was,the,cat,chased,by,the,dog,in]).

in what direction did the dog __ chase the cat __

in what direction was the cat __ chased __ by the dog

in what direction was the cat __ chased by the dog __

what direction did the dog __ chase the cat in __

what direction was the cat __ chased in __ by the dog

what direction was the cat __ chased by the dog in __

generation_elapsed_seconds(0.0625)

Page 20: Moore_slides.ppt

Whatever happened to UG-based NLP? UG-based NLP is elegant, but lacks

robustness for broad-coverage tasks. Hard for human experts to

incorporate enough details for broad coverage, unless grammar/lexicon are very permissive.

Too many possible ambiguities arise as coverage increases.

Page 21: Moore_slides.ppt

How machine-learning-based NLP addresses these problems

Details are learned by processing very large corpora.

Ambiguities are resolved by choosing most likely answer according to a statistical model.

Page 22: Moore_slides.ppt

Increase in stat/ML papers at ACL conferences over 15 years

1988

1993

1998

2003

0

10

20

30

40

50

60

70

80

90

100

1985 1990 1995 2000 2005

Year

Per

cen

t S

tat/

ML

Page 23: Moore_slides.ppt

Characteristics of ML approach to NLP compared to KE approach

Model-driven rather than theory-driven.

Uses shallower analyses and representations.

More opportunistic and more diverse in range of problems addressed.

Often driven by availability of training data.

Page 24: Moore_slides.ppt

Differences in approaches to stat/ML NLP Type of training data

Annotated – supervised training Un-annotated – unsupervised training

Type of model Joint model – e.g., generative probabilistic Conditional model – e.g., conditional

maximum entropy Type of training

Joint – maximum likelihood training Conditional – discriminative training

Page 25: Moore_slides.ppt

Statistical parsing models

Most are: Generative probabilistic models, Trained on annotated data (e.g., Penn

Treebank), Using maximum likelihood training.

The simplest such model would be a probabilistic context-free grammar.

Page 26: Moore_slides.ppt

Probabilistic context-free grammars (PCFGs) A PCFG is a CFG that assigns to each

production a conditional probability of the right-hand side given the left-hand side.

The probability of a derivation is simply the product of the conditional probabilities of all the productions used in the derivation.

PCFG-based parsing chooses, as the parse of a sentence, the derivation of the sentence having the highest probability.

Page 27: Moore_slides.ppt

Problems with simple generative probabilistic models

Incorporating more features into the model splits data, resulting in sparse data problems.

Joint maximum likelihood training “wastes” probability mass predicting the given part of the input data.

Page 28: Moore_slides.ppt

A currently popular technique: conditional maximum entropy models

Basic models are of the form:

Advantages: Using more features does not require

splitting data. Training maximizes conditional

probability rather than joint probability.

iii yxf

xZxyp ),(exp

)(

1)|(

Page 29: Moore_slides.ppt

Unsupervised learning in NLP Tries to infer unknown parameters and alignments

of data to “hidden” states that best explain (i.e., assign highest probability to) un-annotated NL data.

Most common training method is Expectation Maximization (EM):

Assume initial distributions for joint probability of alignments of hidden states to observable data.

Compute joint probabilities for observed training data and all possible alignments.

Re-estimate probability distributions based on probabilistically weighted counts from previous step.

Iterate last two steps until desired convergence is reached.

Page 30: Moore_slides.ppt

Statistical machine translation A leading example of unsupervised

learning in NLP. Models are trained from parallel

bilingual, but otherwise un-annotated corpora.

Models usually assume a sequence of words in one language is produced by a generative probabilistic process from a sequence of words in another language.

Page 31: Moore_slides.ppt

Structure of stat MT models

Often a noisy-channel framework is assumed:

In basic models, each target word is assumed to be generated by one source word.

)|()()|( efefe ppp

Page 32: Moore_slides.ppt

A simple model: IBM model 1

A sentence e produces a sentence f assuming The length m of f is independent of the length l

of e. Each word of f is generated by one word of e

(including an empty word e0). Each word in e is equally likely to generate the

word at any position in f, independently of how any other words are generated.

Mathematically: )|()1(

)|(1 0

i

m

j

l

ijmeft

lp

ef

Page 33: Moore_slides.ppt

More advanced models Most approaches

Model how words are ordered (but crudely). Model how many words a given word is likely

to translates into. Best performing approaches model word-

sequence-to-word-sequence translations. Some initial work has been done on

incorporating syntactic structure into models.

Page 34: Moore_slides.ppt

Examples of machine learned English/Italian word translations PROCESSOR PROCESSORE APPLICATIONS APPLICAZIONI SPECIFY SPECIFICARE NODE NODO DATA DATI SERVICE SERVIZIO THREE TRE IF SE SITES SITI TARGET DESTINAZIONE RESTORATION RIPRISTINO ATTENDANT SUPERVISORE GROUPS GRUPPI MESSAGING

MESSAGGISTICA MONITORING

MONITORAGGIO

THAT CHE FUNCTIONALITY FUNZIONALITÀ PHASE FASE SEGMENT SEGMENTO CUBES CUBI VERIFICATION VERIFICA ALLOWS CONSENTE TABLE TABELLA BETWEEN TRA DOMAINS DOMINI MULTIPLE PIÙ NETWORKS RETI A UN PHYSICALLY FISICAMENTE FUNCTIONS FUNZIONI

Page 35: Moore_slides.ppt

How do KE and ML approaches to NLP compare today? ML has become the dominant paradigm in

NLP. (“Today’s students know everything about maxent modeling, but not what a noun phrase is.”)

ML results are easier to transfer than KE results.

We probably now have enough computer power and data to learn more by ML than a linguistic expert could encode in a lifetime.

In almost every independent evaluation, ML methods outperform KE methods in practice.

Page 36: Moore_slides.ppt

Do we still need linguistics in computational linguistics? There are still many things we are not

good at modeling statistically. For example, stat MT models based on

single-words or strings are good at getting the right words, but poor at getting them in the right order.

Consider: La profesora le gusta a tu hermano. Your brother likes the teacher. The teacher likes your brother.

Page 37: Moore_slides.ppt

Concluding thoughts If forced to choose between a pure ML approach

and a pure KE approach, ML almost always wins. Statistical models still seem to need a lot more

linguistic features for really high performance. A lot of KE is actually hidden in ML approaches,

in the form of annotated data, which is usually expensive to obtain.

The way forward may be to find methods for experts to give advice to otherwise unsupervised ML methods, which may be cheaper than annotating enough data to learn the content of the advice.