Top Banner
600.465 Connecting the dots - I (NLP in Practice) Delip Rao [email protected]
35

600.465 Connecting the dots - I (NLP in Practice) Delip Rao [email protected].

Dec 23, 2015

Download

Documents

Cordelia James
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

600.465 Connecting the dots - I(NLP in Practice)

Delip [email protected]

Page 3: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

What is “Text”?

Page 4: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

What is “Text”?

Page 5: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

What is “Text”?

Page 6: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

“Real” World

• Tons of data on the web• A lot of it is text• In many languages• In many genres

Language by itself is complex. The Web further complicates language.

Page 7: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

But we have 600.465

Adapted from : Jason Eisner

We can study anything about language ...

1. Formalize some insights 2. Study the formalism mathematically 3. Develop & implement algorithms 4. Test on real data

feature functions!f(wi = off, wi+1 = the)

f(wi = obama, yi = NP)

Forward Backward, Gradient Descent, LBFGS,

Simulated Annealing, Contrastive Estimation, …

Page 8: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NLP for fun and profit

• Making NLP more accessible– Provide APIs for common NLP tasksvar text = document.get(…);var entities = agent.markNE(text);

• Big $$$$• Backend to intelligent processing of text

Page 9: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Desideratum: Multilinguality

• Except for feature extraction, systems should be language agnostic

Page 10: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

In this lecture

• Understand how to solve and ace in NLP tasks• Learn general methodology or approaches• End-to-End development using an example

task• Overview of (un)common NLP tasks

Page 11: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Case study: Named Entity Recognition

Page 12: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Case study: Named Entity Recognition

• Demo: http://viewer.opencalais.com

• How do we build something like this?• How do we find out well we are doing?• How can we improve?

Page 13: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Case study: Named Entity Recognition

• Define the problem– Say, PERSON, LOCATION, ORGANIZATION

The UN secretary general met president Obama at Hague.

The UN secretary general met president Obama at Hague.

ORG PER LOC

Page 14: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Case study: Named Entity Recognition

• Collect data to learn from– Sentences with words marked as PER, ORG, LOC,

NONE• How do we get this data?

Page 15: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Pay the experts

Page 16: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Wisdom of the crowds

Page 17: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Getting the data: Annotation

• Time consuming• Costs $$$• Need for quality control– Inter-annotator aggreement– Kappa score (Kippendorf, 1980)

• Smarter ways to annotate– Get fewer annotations: Active Learning– Rationales (Zaidan, Eisner & Piatko, 2007)

Page 18: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Only France and Great Britain backed Fischler ‘s proposal .

Only O

France B-LOC

and O

Great B-LOC

Britain I-LOC

backed O

Fischler B-PER

‘s O

proposal O

. O

Only France and Great Britain backed Fischler ‘s proposal .

Input x

Labels y

y = g(x)

Page 19: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

1. Formalize some insights 2. Study the formalism mathematically 3. Develop & implement algorithms 4. Test on real data

Our recipe …

Page 20: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing features

• Not as trivial as you think• Original text itself might be in

an ugly HTML• Cleaneval!

• Need to segment sentences• Tokenize the sentences

OnlyFrance

and

Great

Britain

backed

Fischler

‘s

proposal

.

Preprocessing

Page 21: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing featuresOnly IS_CAPITALIZEDFrance IS_CAPITALIZED

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 22: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 23: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 24: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED IN_LEXICON_LOC

and

Great IS_CAPITALIZED

Britain IS_CAPITALIZED IN_LEXICON_LOC

backed

Fischler IS_CAPITALIZED

‘s

proposal

.

Page 25: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing featuresOnly POS=RB IS_CAPITALIZED IS_SENT_STARTFrance POS=NNP IS_CAPITALIZED IN_LEXICON_LOC

and POS=CC

Great POS=NNP IS_CAPITALIZED

Britain POS=NNP IS_CAPITALIZED IN_LEXICON_LOC

backed POS=VBD

Fischler POS=NNP IS_CAPITALIZED

‘s POS=XX

proposal POS=NN

. POS=.

These are extracted during

preprocessing!

Page 26: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing featuresOnly POS=RB IS_CAPITALIZED … PREV_WORD=_NONE_

France POS=NNP IS_CAPITALIZED … PREV_WORD=only

and POS=CC … PREV_WORD=france

Great POS=NNP IS_CAPITALIZED … PREV_WORD=and

Britain POS=NNP IS_CAPITALIZED … PREV_WORD=great

backed POS=VBD … PREV_WORD=britain

Fischler POS=NNP IS_CAPITALIZED … PREV_WORD=backed

‘s POS=XX … PREV_WORD=fischler

proposal POS=NN … PREV_WORD=‘s

. POS=. … PREV_WORD=proposal

Page 27: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing featuresOnly POS=RB IS_CAPITALIZED … PREV_WORD=_NONE_ …France POS=NNP IS_CAPITALIZED … PREV_WORD=only …and POS=CC … PREV_WORD=france …Great POS=NNP IS_CAPITALIZED … PREV_WORD=and …Britain POS=NNP IS_CAPITALIZED … PREV_WORD=great …backed

POS=VBD … PREV_WORD=britain …Fischler

POS=NNP IS_CAPITALIZED … PREV_WORD=backed …‘s POS=XX … PREV_WORD=fischler …proposal

POS=NN … PREV_WORD=‘s …. POS=. … PREV_WORD=proposal …

Page 28: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Designing features

• Can you think of other features?HAS_DIGITSIS_HYPHENATEDIS_ALLCAPSFREQ_WORDRARE_WORDUSEFUL_UNIGRAM_PERUSEFUL_BIGRAM_PERUSEFUL_UNIGRAM_LOCUSEFUL_BIGRAM_LOCUSEFUL_UNIGRAM_ORGUSEFUL_BIGRAM_ORGUSEFUL_SUFFIX_PERUSEFUL_SUFFIX_LOCUSEFUL_SUFFIX_ORG

WORDPREV_WORDNEXT_WORDPREV_BIGRAMNEXT_BIGRAMPOSPREV_POSNEXT_POSPREV_POS_BIGRAMNEXT_POS_BIGRAMIN_LEXICON_PERIN_LEXICON_LOCIN_LEXICON_ORGIS_CAPITALIZED

Page 29: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Case: Named Entity Recognition

• Evaluation Metrics– Token accuracy: What percent of the tokens got

labeled correctly– Problem with accuracy– Precision-Recall-F

Model F-ScoreHMM 74.6

president OBarack B-PERObama O

Page 30: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: How can we improve?

• Engineer better features• Design better models• Conditional Random Fields

Model F-ScoreHMM 74.6

TBL 81.2

Maxent 85.6

x1

Y1

x2

Y2

x3

Y3

x4

Y4

Model F-ScoreHMM 74.6

TBL 81.2

Maxent 85.6

CRF 91.7

… …

Page 31: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: How else can we improve?

• Unlabeled data!

example from Jerry Zhu

Page 32: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER : Challenges

• Domain transfer WSJ NYT WSJ Blogs ?? WSJ Twitter ??!?• Tough nut: Organizations• Non textual data?

Entity Extraction is a Boring Solved Problem – or is it?(Vilain, Su and Lubar, 2007)

Page 33: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Related application

• Extracting real estate information from Criagslist Ads

Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.

Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.

Page 34: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

NER: Related Application

• BioNLP: Annotation of chemical entities

Corbet, Batchelor & Teufel, 2007

Page 35: 600.465 Connecting the dots - I (NLP in Practice) Delip Rao delip@jhu.edu.

Shared Tasks: NLP in practice

• Shared Task– Everybody works on a (mostly) common dataset– Evaluation measures are defined– Participants get ranked on the evaluation

measures– Advance the state of the art– Set benchmarks

• Tasks involve common hard problems or new interesting problems