Natural Language Processing
Jun 20, 2015
Natural Language Processing
A Broad Topic
• Named Entity Recognition (NER) • Speech Recognition • Sentiment Analysis • Translation • Optical Character Recognition (OCR)
The Goals of NER
1. Detecting mentions of specific entities 2. Classifying those entities (person, place, …) 3. Grouping different references to the same entity
The Challenges of NER1. Named entities are often compound noun phrases 》e.g., “United Airlines”
2. Capitals aren’t a foolproof indicator 3. A single named entity can be referred to variously 》e.g., “United,” “the airline,” even just “they”
4. Rule-based approaches just don’t work 》“Every time I fire a linguist, the performance of
the speech recognizer goes up.” —Frederick Jelinek
Aside: Prescriptivism vs. Descriptivism
• Descriptivism holds that: 》grammar is a function of empirical usage 》“correctness” is context-dependent 》languages have fuzzy edges, like species
• Prescriptivism holds that “correctness” is absolute • Not a real debate! All serious linguists are
descriptivists • But style guides are still useful
The Process
1. Sentence segmentation 2. Tokenization 3. Part-of-speech (POS) tagging 4. Parsing 5. Co-reference resolution
Aside: n-grams
• n-grams are word sequences of a fixed length • they become useful when we count their occurrences within a corpus
• we can characterize a corpus by using these counts to determine transition probabilities 》e.g., P(yellow|the sun is)
Sentence Segmentation
Usually ML-based, since it’s hard to devise rules: !
1. Assemble an annotated training set 2. Create a set of features, e.g.: 》 Previous word, next word, prefix, etc. 》 Probabilities, like P(sentence start|previous word)
3. Train a classifier and predict on punctuation !
Tokenization: Basically the Same Damn Thing
NICE!!!!!!!!!!
POS TaggingDominated by Hidden Markov Models (HMMs). !
1. Assemble an annotated training set 2. Determine emission probabilities 》 e.g., P(dog|noun)
3. Determine transition probabilities 》 e.g., P(adjective|noun,verb)
4. For a given word sequence, determine most likely sequence of underlying POS using HMM
ParsingIn a sense, similar to PL parsing. !
1. Select a formal context-free grammar for the language being parsed 》in practice, usually augmented with
probabilities for the replacement rules 2. Parse according to taste 3. Select most likely parse by multiplying
probabilities of each replacement rule
Coreference ResolutionUsing the Hobbs algorithm: !
1. Create parse tree from input text 2. With text and parse tree as input, traverse
successive parent trees of pronouns and proper nouns, rank remaining according several factors: 》words in common 》 proximity 》 etc.
3. Pick most likely candidate and label
Other Cool Stuff: Morphology
uygarlaştıramadıklarımızdanmışsınızcasına uygar +laş +tır +ama +dık +lar +ımız +dan +mış +sınız +casına civilized +BEC +CAUS +NABL +PART +PL +P1PL +ABL +PAST +2PL +AsIf “(behaving) as if you are among those whom we could not civilize” !
+BEC “become”+CAUS the causative verb marker (‘cause to X’)+NABL “not able” +PART past participle form+P1PL 1st person pl possessive agreement+2PL 2nd person pl +ABL ablative (from/among) case marker+AsIf derivationally forms an adverb from a finite verb
Other Cool Stuff: Language Synthesis with RNNs
http://www.cs.toronto.edu/~graves/handwriting.html
The Future
• NLP is “AI-complete,” that is, we expect that solving it is tantamount to solving hard AI
• In the meantime, it all comes down to more data
References
Jurafsky, Dan, and James H. Martin. Speech and language processing. 2. ed., Pearson new internat. ed. Upper Saddle River, NJ [u.a.: Prentice Hall, Pearson Education International, 2014. Print. !Graves, Alex. "Recurrent Neural Network Handwriting Generation Demo." Department of Computer Science, University of Toronto. N.p., n.d. Web. 13 Aug. 2014. <http://www.cs.toronto.edu/~graves/handwriting.html>.
The End