Dec 23, 2015
“Real” World
• Tons of data on the web• A lot of it is text• In many languages• In many genres
Language by itself is complex. The Web further complicates language.
But we have 600.465
Adapted from : Jason Eisner
We can study anything about language ...
1. Formalize some insights 2. Study the formalism mathematically 3. Develop & implement algorithms 4. Test on real data
feature functions!f(wi = off, wi+1 = the)
f(wi = obama, yi = NP)
Forward Backward, Gradient Descent, LBFGS,
Simulated Annealing, Contrastive Estimation, …
NLP for fun and profit
• Making NLP more accessible– Provide APIs for common NLP tasksvar text = document.get(…);var entities = agent.markNE(text);
• Big $$$$• Backend to intelligent processing of text
In this lecture
• Understand how to solve and ace in NLP tasks• Learn general methodology or approaches• End-to-End development using an example
task• Overview of (un)common NLP tasks
Case study: Named Entity Recognition
• Demo: http://viewer.opencalais.com
• How do we build something like this?• How do we find out well we are doing?• How can we improve?
Case study: Named Entity Recognition
• Define the problem– Say, PERSON, LOCATION, ORGANIZATION
The UN secretary general met president Obama at Hague.
The UN secretary general met president Obama at Hague.
ORG PER LOC
Case study: Named Entity Recognition
• Collect data to learn from– Sentences with words marked as PER, ORG, LOC,
NONE• How do we get this data?
Getting the data: Annotation
• Time consuming• Costs $$$• Need for quality control– Inter-annotator aggreement– Kappa score (Kippendorf, 1980)
• Smarter ways to annotate– Get fewer annotations: Active Learning– Rationales (Zaidan, Eisner & Piatko, 2007)
Only France and Great Britain backed Fischler ‘s proposal .
Only O
France B-LOC
and O
Great B-LOC
Britain I-LOC
backed O
Fischler B-PER
‘s O
proposal O
. O
Only France and Great Britain backed Fischler ‘s proposal .
Input x
Labels y
€
y = g(x)
1. Formalize some insights 2. Study the formalism mathematically 3. Develop & implement algorithms 4. Test on real data
Our recipe …
NER: Designing features
• Not as trivial as you think• Original text itself might be in
an ugly HTML• Cleaneval!
• Need to segment sentences• Tokenize the sentences
OnlyFrance
and
Great
Britain
backed
Fischler
‘s
proposal
.
Preprocessing
NER: Designing featuresOnly IS_CAPITALIZEDFrance IS_CAPITALIZED
and
Great IS_CAPITALIZED
Britain IS_CAPITALIZED
backed
Fischler IS_CAPITALIZED
‘s
proposal
.
NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED
and
Great IS_CAPITALIZED
Britain IS_CAPITALIZED
backed
Fischler IS_CAPITALIZED
‘s
proposal
.
NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED
and
Great IS_CAPITALIZED
Britain IS_CAPITALIZED
backed
Fischler IS_CAPITALIZED
‘s
proposal
.
NER: Designing featuresOnly IS_CAPITALIZED IS_SENT_STARTFrance IS_CAPITALIZED IN_LEXICON_LOC
and
Great IS_CAPITALIZED
Britain IS_CAPITALIZED IN_LEXICON_LOC
backed
Fischler IS_CAPITALIZED
‘s
proposal
.
NER: Designing featuresOnly POS=RB IS_CAPITALIZED IS_SENT_STARTFrance POS=NNP IS_CAPITALIZED IN_LEXICON_LOC
and POS=CC
Great POS=NNP IS_CAPITALIZED
Britain POS=NNP IS_CAPITALIZED IN_LEXICON_LOC
backed POS=VBD
Fischler POS=NNP IS_CAPITALIZED
‘s POS=XX
proposal POS=NN
. POS=.
These are extracted during
preprocessing!
NER: Designing featuresOnly POS=RB IS_CAPITALIZED … PREV_WORD=_NONE_
France POS=NNP IS_CAPITALIZED … PREV_WORD=only
and POS=CC … PREV_WORD=france
Great POS=NNP IS_CAPITALIZED … PREV_WORD=and
Britain POS=NNP IS_CAPITALIZED … PREV_WORD=great
backed POS=VBD … PREV_WORD=britain
Fischler POS=NNP IS_CAPITALIZED … PREV_WORD=backed
‘s POS=XX … PREV_WORD=fischler
proposal POS=NN … PREV_WORD=‘s
. POS=. … PREV_WORD=proposal
NER: Designing featuresOnly POS=RB IS_CAPITALIZED … PREV_WORD=_NONE_ …France POS=NNP IS_CAPITALIZED … PREV_WORD=only …and POS=CC … PREV_WORD=france …Great POS=NNP IS_CAPITALIZED … PREV_WORD=and …Britain POS=NNP IS_CAPITALIZED … PREV_WORD=great …backed
POS=VBD … PREV_WORD=britain …Fischler
POS=NNP IS_CAPITALIZED … PREV_WORD=backed …‘s POS=XX … PREV_WORD=fischler …proposal
POS=NN … PREV_WORD=‘s …. POS=. … PREV_WORD=proposal …
NER: Designing features
• Can you think of other features?HAS_DIGITSIS_HYPHENATEDIS_ALLCAPSFREQ_WORDRARE_WORDUSEFUL_UNIGRAM_PERUSEFUL_BIGRAM_PERUSEFUL_UNIGRAM_LOCUSEFUL_BIGRAM_LOCUSEFUL_UNIGRAM_ORGUSEFUL_BIGRAM_ORGUSEFUL_SUFFIX_PERUSEFUL_SUFFIX_LOCUSEFUL_SUFFIX_ORG
WORDPREV_WORDNEXT_WORDPREV_BIGRAMNEXT_BIGRAMPOSPREV_POSNEXT_POSPREV_POS_BIGRAMNEXT_POS_BIGRAMIN_LEXICON_PERIN_LEXICON_LOCIN_LEXICON_ORGIS_CAPITALIZED
Case: Named Entity Recognition
• Evaluation Metrics– Token accuracy: What percent of the tokens got
labeled correctly– Problem with accuracy– Precision-Recall-F
Model F-ScoreHMM 74.6
president OBarack B-PERObama O
NER: How can we improve?
• Engineer better features• Design better models• Conditional Random Fields
Model F-ScoreHMM 74.6
TBL 81.2
Maxent 85.6
x1
Y1
x2
Y2
x3
Y3
x4
Y4
Model F-ScoreHMM 74.6
TBL 81.2
Maxent 85.6
CRF 91.7
… …
NER : Challenges
• Domain transfer WSJ NYT WSJ Blogs ?? WSJ Twitter ??!?• Tough nut: Organizations• Non textual data?
Entity Extraction is a Boring Solved Problem – or is it?(Vilain, Su and Lubar, 2007)
NER: Related application
• Extracting real estate information from Criagslist Ads
Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.
Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.