Natural Language Processing

Natural Language Processing

A Broad Topic

• Named Entity Recognition (NER) • Speech Recognition • Sentiment Analysis • Translation • Optical Character Recognition (OCR)

The Goals of NER

1. Detecting mentions of specific entities 2. Classifying those entities (person, place, …) 3. Grouping different references to the same entity

The Challenges of NER1. Named entities are often compound noun phrases 》e.g., “United Airlines”

2. Capitals aren’t a foolproof indicator 3. A single named entity can be referred to variously 》e.g., “United,” “the airline,” even just “they”

4. Rule-based approaches just don’t work 》“Every time I fire a linguist, the performance of

the speech recognizer goes up.” —Frederick Jelinek

Aside: Prescriptivism vs. Descriptivism

• Descriptivism holds that: 》grammar is a function of empirical usage 》“correctness” is context-dependent 》languages have fuzzy edges, like species

• Prescriptivism holds that “correctness” is absolute • Not a real debate! All serious linguists are

descriptivists • But style guides are still useful

The Process

1. Sentence segmentation 2. Tokenization 3. Part-of-speech (POS) tagging 4. Parsing 5. Co-reference resolution

Aside: n-grams

• n-grams are word sequences of a fixed length • they become useful when we count their occurrences within a corpus

• we can characterize a corpus by using these counts to determine transition probabilities 》e.g., P(yellow|the sun is)

Sentence Segmentation

Usually ML-based, since it’s hard to devise rules: !

1. Assemble an annotated training set 2. Create a set of features, e.g.: 》 Previous word, next word, prefix, etc. 》 Probabilities, like P(sentence start|previous word)

3. Train a classifier and predict on punctuation !

Tokenization: Basically the Same Damn Thing

NICE!!!!!!!!!!

POS TaggingDominated by Hidden Markov Models (HMMs). !

1. Assemble an annotated training set 2. Determine emission probabilities 》 e.g., P(dog|noun)

3. Determine transition probabilities 》 e.g., P(adjective|noun,verb)

4. For a given word sequence, determine most likely sequence of underlying POS using HMM

ParsingIn a sense, similar to PL parsing. !

1. Select a formal context-free grammar for the language being parsed 》in practice, usually augmented with

probabilities for the replacement rules 2. Parse according to taste 3. Select most likely parse by multiplying

probabilities of each replacement rule

Coreference ResolutionUsing the Hobbs algorithm: !

1. Create parse tree from input text 2. With text and parse tree as input, traverse

successive parent trees of pronouns and proper nouns, rank remaining according several factors: 》words in common 》 proximity 》 etc.

3. Pick most likely candidate and label

Other Cool Stuff: Morphology

uygarlaştıramadıklarımızdanmışsınızcasına uygar +laş +tır +ama +dık +lar +ımız +dan +mış +sınız +casına civilized +BEC +CAUS +NABL +PART +PL +P1PL +ABL +PAST +2PL +AsIf “(behaving) as if you are among those whom we could not civilize” !

+BEC “become”+CAUS the causative verb marker (‘cause to X’)+NABL “not able” +PART past participle form+P1PL 1st person pl possessive agreement+2PL 2nd person pl +ABL ablative (from/among) case marker+AsIf derivationally forms an adverb from a finite verb

Other Cool Stuff: Language Synthesis with RNNs

http://www.cs.toronto.edu/~graves/handwriting.html


The Future

• NLP is “AI-complete,” that is, we expect that solving it is tantamount to solving hard AI

• In the meantime, it all comes down to more data

References

Jurafsky, Dan, and James H. Martin. Speech and language processing. 2. ed., Pearson new internat. ed. Upper Saddle River, NJ [u.a.: Prentice Hall, Pearson Education International, 2014. Print. !Graves, Alex. "Recurrent Neural Network Handwriting Generation Demo." Department of Computer Science, University of Toronto. N.p., n.d. Web. 13 Aug. 2014. <http://www.cs.toronto.edu/~graves/handwriting.html>.


The End

Natural Language Processing

Software

pl parsing

transition probabilities

speech pos

emission probabilities

likely parse

word sequences

entities person

parse tree