Top Banner
Knowledge Representation in Digital Humanities Antonio Jiménez Mavillard Department of Modern Languages and Literatures Western University
119

Lecture09

May 09, 2015

Download

Technology

mavillard
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture09

Knowledge Representationin

Digital HumanitiesAntonio Jiménez Mavillard

Department of Modern Languages and LiteraturesWestern University

Page 2: Lecture09

Lecture 9

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard

* Contents: 1. Why this lecture? 2. Discussion 3. Chapter 9 4. Assignment 5. Bibliography

2

Page 3: Lecture09

Why this lecture?

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard

* This lecture... · teaches some NLP techniques subject to be applied to real problems · presents another example of how DH put together various disciplines (Linguistics, Artificial Intelligence, Information Science, Statistics...) to solve problems

3

Page 4: Lecture09

Last assignment discussion

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard

* Time to... · consolidate ideas and concepts dealt in the readings

4

Page 5: Lecture09

Chapter 9

Natural Language Processingin Python

1. Preliminary theory2. Word tagging and categorization3. Text classification4. Text information extraction

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard5

Page 6: Lecture09

Chapter 9

1 Preliminary theory 1.1 Linguistics 1.2 Statistics 1.3 Artificial Intelligence

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard6

Page 7: Lecture09

Chapter 9

2 Word tagging and categorization 2.1 Tagger 2.2 Automatic tagging 2.3 n-gram tagging

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard7

Page 8: Lecture09

Chapter 9

3 Text classification 3.1 Supervised classification 3.2 Document classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard8

Page 9: Lecture09

Chapter 9

4 Text information extraction 4.1 Information extraction 4.2 Entity recognition 4.3 Relation extraction

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard9

Page 10: Lecture09

Preliminary theory

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard10

Page 11: Lecture09

Linguistics

* Lexical categories · nouns: people, places, things, concepts · verbs: actions · adjectives: describes nouns · adverbs: modifies adjectives and verbs · ...

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard11

Page 12: Lecture09

Linguistics

* Lexical categories

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard12

Page 13: Lecture09

Linguistics

* These word classes are also known as part-of-speech* They arise from simple analysis of the distribution of words in text

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard13

Page 14: Lecture09

Statistics

* Frequency distribution · Arrangement of the values that one or more variables take in a sample

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard14

Page 15: Lecture09

Statistics* Frequency distribution · Example: vocabulary in a text + how many times each word appears in the text? + it is a “distribution” since it tells us how the total number of word tokens in the text are distributed across the vocabulary items

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard15

Page 16: Lecture09

Statistics

* Frequency distribution · Example: vocabulary in a text

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard16

Page 17: Lecture09

Statistics

* Conditional frequency distribution · A collection of frequency distributions, each one for a different condition

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard17

Page 18: Lecture09

Statistics* Conditional frequency distribution · Example: vocabulary in a text + when the texts of a corpus are divided into several categories we can maintain separate frequency distributions for each category + the condition will often be the category of the text

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard18

Page 19: Lecture09

Statistics

* Conditional frequency distribution · Example: vocabulary in a text

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard19

Page 20: Lecture09

Artificial Intelligence

* Supervised vs unsupervised learning · Supervised learning: + Possible results are known + Data is labeled · Unsupervised learning: + Results are unknown + Data is clustered

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard20

Page 21: Lecture09

Artificial Intelligence

* Decision trees · Flowchart that selects labels for input values · Formed by decision and leaf nodes · Decision nodes: check feature values · Leaf nodes: assign labels

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard21

Page 22: Lecture09

Artificial Intelligence

* Decision trees · Example: “Going out?”

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard22

Page 23: Lecture09

Artificial Intelligence

* Naive Bayes classifiers 1. Begins by calculating the prior probability of each label, determined by checking the frequency of each label in the training set

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard23

Page 24: Lecture09

Artificial Intelligence* Naive Bayes classifiers 2. The contribution from each feature is combined with this prior probability, to arrive at a likelihood estimate for each label 3. The label whose likelihood estimate is the highest is then assigned to the input value

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard24

Page 25: Lecture09

Artificial Intelligence* Naive Bayes classifiers · Example: document classification Prior probability: close “Automotive”

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard25

Page 26: Lecture09

References

“Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.

Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.

Mitchell, Tom M. “Chapter 6: Bayesian Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.

“Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.

Steven Bird, Ewan Klein, and Edward Loper. “Conditional Frequency Distributions.” Natural Language Processing with

Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.

Steven Bird, Ewan Klein, and Edward Loper. “Frequency Distributions.” Natural Language Processing with Python. O’Reilly

Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard26

Page 27: Lecture09

Word tagging and classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard27

Page 28: Lecture09

Tagger

* Processes a sequence of words, and attaches a part of speech tag to each word* Procedure: 1. Tokenization 2. Tagging

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard28

Page 29: Lecture09

Tagger

* Example 1:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard29

In [1]: text = 'And now for something completely different'

In [2]: tokens = nltk.word_tokenize(text)

In [3]: nltk.pos_tag(tokens)Out[3]: [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

Page 30: Lecture09

Tagger* Example 2:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard30

In [1]: text = 'They refuse to permit us to obtain the refuse permit'

In [2]: tokens = nltk.word_tokenize(text)

In [3]: nltk.pos_tag(tokens)Out[3]: [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

Page 31: Lecture09

Automatic tagging

* The tag of a word depends on the word itself and its context within a sentence* Working with data at the level of tagged sentences rather than tagged words

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard31

Page 32: Lecture09

Automatic tagging

* Loading data · Example: tagged and non-tagged sentences of “news” category

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard32

In [1]: from nltk.corpus import brown

In [2]: brown_tagged_sents =                              brown.tagged_sents(categories='news')

In [3]: brown_sents = brown.sents(categories='news')

Page 33: Lecture09

Automatic tagging

* Default tagger · Chose the most likely tag

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard33

In [4]: tags = [tag for (word, tag) in brown.tagged_words(categories='news')]

In [4]: nltk.FreqDist(tags).max()Out[4]: 'NN'

Page 34: Lecture09

Automatic tagging

* Default tagger · Assign the most likely tag to each token

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard34

In [5]: text = 'I do not like green eggs and ham, I do not like them Sam I am!'

In [6]: tokens = nltk.word_tokenize(text)

In [7]: default_tagger = nltk.DefaultTagger('NN')

Page 35: Lecture09

Automatic tagging

* Default tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard35

In [8]: default_tagger.tag(tokens)Out[8]: [('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'),

Page 36: Lecture09

Automatic tagging

* Default tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard36

... ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]

Page 37: Lecture09

Automatic tagging

* Default tagger · This method performs rather poorly

· Unknown words will be nouns (as it happens, most new words are nouns)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard37

In [9]: default_tagger.evaluate(brown_tagged_sents)Out[9]: 0.13089484257215028

Page 38: Lecture09

Automatic tagging* Regular expression tagger · Assigns tags to tokens on the basis of matching patterns

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard38

In [10]: patterns = [   ...: (r'.*ing$', 'VBG'),              # gerounds   ...: (r'.*ed$', 'VBD'),               # simple past   ...: (r'.*es$', 'VBZ'),               # 3rd sing present   ...: (r'.*ould$', 'MD'),              # modals   ...: (r'.*\'s$', 'NN$'),              # possessive nouns   ...: (r'.*s$', 'NNS'),                # plural nouns   ...: (r'^­?[0­9]+(.[0­9]+)?$', 'CD'), # cardinal numbers   ...: (r'.*', 'NN'),                   # nouns (default)   ...: ]

In [11]: regexp_tagger = nltk.RegexpTagger(patterns)

Page 39: Lecture09

Automatic tagging* Regular expression tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard39

In [12]: regexp_tagger.tag(brown_sents[3])Out[12]: [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ...]

Page 40: Lecture09

Automatic tagging* Regular expression tagger · This method is correct about a fifth of the time

· The final regular expression «.*» is a catch-all that tags everything as a noun

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard40

In [13]: regexp_tagger.evaluate(brown_tagged_sents)Out[13]: 0.20326391789486245

Page 41: Lecture09

Automatic tagging

* Lookup tagger · Problem: a lot of high-frequency words do not have the NN tag

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard41

Page 42: Lecture09

Automatic tagging

* Lookup tagger · Solution: + Find the hundred most frequent words and store their most likely tag + Use this information as model for a lookup tagger (NLTK UnigramTagger) + Tag everything else as a noun

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard42

Page 43: Lecture09

Automatic tagging* Lookup tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard43

In [14]: fd = nltk.FreqDist(brown.words(categories='news'))

In [15]: cfd = #counts how many times a word belongs to a category nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

In [16]: most_freq_words = fd.keys()[:100]

In [17]: likely_tags = dict((word, cfd[word].max()) for word in most_freq_words) #from all categories of a word, take the maximum

In [18]: baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))

In [19]: baseline_tagger.evaluate(brown_tagged_sents)Out[19]: 0.5817769556656125

Page 44: Lecture09

Automatic tagging

* Lookup tagger · The tagger accuracy increases as the model size grows

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard44

Page 45: Lecture09

n-gram tagging

* Unigram tagger · As the lookup tagger, assign the most likely tag to each token · As opposed to the default tagger, it is trained for setting it up · Training: initialize the tagger with a tagged sentence data as a parameter

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard45

Page 46: Lecture09

n-gram tagging

* Unigram tagger · Separate the data in: + Training data (90%) + Testing data (10%)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard46

Page 47: Lecture09

n-gram tagging* Unigram tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard47

In [20]: size = int(len(brown_tagged_sents) * 0.9)

In [21]: train_sents = brown_tagged_sents[:size]

In [22]: test_sents = brown_tagged_sents[size:]

In [23]: unigram_tagger = nltk.UnigramTagger(train_sents)

Page 48: Lecture09

n-gram tagging* Unigram tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard48

In [24]: unigram_tagger.tag(brown_sents[2007])Out[24]: [('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','),

...

Page 49: Lecture09

n-gram tagging* Unigram tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard49

 ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

In [21]: unigram_tagger.evaluate(test_sents)Out[21]: 0.8110236220472441

Page 50: Lecture09

n-gram tagging

* An n-gram tagger picks the tag that is most likely in the given context* Unigram (1-gram) tagger · Context: + current token in isolation

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard50

Page 51: Lecture09

n-gram tagging* Bigram (2-gram) tagger · Context: + current token + POS tag of the 1 preceding token* Trigram (3-gram) tagger · Context: + current token + POS tag of the 2 preceding tokens

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard51

Page 52: Lecture09

n-gram tagging* n-gram tagger · Context: + current token + POS tag of the n-1 preceding tokens

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard52

Page 53: Lecture09

n-gram tagging

* n-gram tagger · Example: bigram

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard53

In [22]: bigram_tagger = nltk.BigramTagger(train_sents)

In [23]: bigram_tagger.evaluate(train_sents)Out[23]: 0.7853094861965731

In [24]: bigram_tagger.evaluate(test_sents)Out[24]: 0.10216286255357321

Page 54: Lecture09

n-gram tagging

* n-gram tagger · Example: bigram + Problem: it manages to tag words in sentences of training data but - it is unable to tag a new word (assigns None)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard54

Page 55: Lecture09

n-gram tagging* n-gram tagger · Example: bigram + Problem: it manages to tag words in sentences of training data but - it cannot tag the following word (even if it is not new) because it never saw it during training with a None tag on the previous word

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard55

Page 56: Lecture09

n-gram tagging

* n-gram tagger · Example: bigram + Name: sparse data + Reason: specific contexts with no default tagger + Solution: trade-off between accuracy and coverage

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard56

Page 57: Lecture09

n-gram tagging

* Combining taggers · Trade-off between accuracy and coverage

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard57

Page 58: Lecture09

n-gram tagging

* Combining taggers 1. Try tagging with the n-gram tagger 2. If unable, try the (n-1)-gram tagger 3. If unable, try the (n-2)-gram tagger ...

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard58

Page 59: Lecture09

n-gram tagging

* Combining taggers ... n-2. If unable, try the trigram tagger n-1. If unable, try the bigram tagger n. If unable, try the unigram tagger n+1. If unable, use the default tagger

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard59

Page 60: Lecture09

n-gram tagging

* Combining taggers · Example:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard60

In [25]: t0 = nltk.DefaultTagger('NN')

In [26]: t1 = nltk.UnigramTagger(train_sents, backoff=t0)

In [27]: t2 = nltk.BigramTagger(train_sents, backoff=t1)

In [28]: t2.evaluate(test_sents)Out[28]: 0.8447124489185687

Page 61: Lecture09

n-gram tagging

* Exercise 1 · Build a tagger by combining a trigram, a bigram, a unigram and a regular expression tagger (in the default case) · Use it to tag a sentence · Evaluate its performance

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard61

Page 62: Lecture09

n-gram tagging

* Exercise 1 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard62

import nltkimport refrom nltk.corpus import brown

Page 63: Lecture09

n-gram tagging

* Exercise 1 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard63

patterns = [    (r'.*ing$', 'VBG'),    (r'.*ed$', 'VBD'),    (r'.*es$', 'VBZ'),    (r'.*ould$', 'MD'),    (r".\'s$", 'NN$'),    (r'.*s$', 'NNS'),    (r'^­?[0­9]+(.[0­9]+)?$', 'CD'),    (r'.*', 'NN')]

Page 64: Lecture09

n-gram tagging

* Exercise 1 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard64

brown_tagged_sents = brown.tagged_sents(categories='news')size = int(len(brown_tagged_sents) * 0.9)train_sents = brown_tagged_sents[:size]test_sents = brown_tagged_sents[size:]

Page 65: Lecture09

n-gram tagging

* Exercise 1 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard65

t0 = nltk.RegexpTagger(patterns)t1 = nltk.UnigramTagger(train_sents, backoff=t0)t2 = nltk.BigramTagger(train_sents, backoff=t1)t3 = nltk.TrigramTagger(train_sents, backoff=t1)

Page 66: Lecture09

n-gram tagging

* Exercise 1 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard66

brown_sents = brown.sents(categories='news')sent = brown_sents[2007]t3.tag(sent)t3.evaluate(brown_tagged_sents)

Page 67: Lecture09

References

Steven Bird, Ewan Klein, and Edward Loper. “Chapter 5: Categorizing and Tagging Words.” Natural Language Processing

with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard67

Page 68: Lecture09

Text classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard68

Page 69: Lecture09

Supervised classification

* Idea

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard69

Page 70: Lecture09

Supervised classification

* Process 1. Features 2. Encode 3. Feature extractor

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard70

Page 71: Lecture09

Supervised classification

* The process involves important skills: · Abstraction · Modelling · Programming

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard71

Page 72: Lecture09

Supervised classification

* Features · Abstraction: decide the relevant information of the data set* Encode · Modelling: choose a sound representation (data structure)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard72

Page 73: Lecture09

Supervised classification

* Feature extractor · Programming: program a function that extracts the features in the chosen representation

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard73

Page 74: Lecture09

Supervised classification

* Applications: · Deciding the lexical category of words: POS tagging · Deciding the topic of a document from a list of topics (“sports”, “technology”, etc.): document classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard74

Page 75: Lecture09

Document classification* Example 1: gender identification (solved by Naive Bayesian Classifier) · Evidence + Names ending in a, e, i => female + Names ending in k, o, r, s, t => male · Features: last letter · Encode: dictionary · Feature extractor: “name => {last letter}”

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard75

Page 76: Lecture09

Document classification

* Example 1: gender identification · Data

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard76

In [1]: from nltk.corpus import names

In [2]: import random

In [3]: all_names = [(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]

In [4]: random.shuffle(all_names)

Page 77: Lecture09

Document classification

* Example 1: gender identification · Feature extractor

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard77

In [5]: def gender_features(word):            return {'last_letter': word[­1]}

# ExampleIn [6]: gender_features('Shrek')Out[6]: {'last_letter': 'k'}

Page 78: Lecture09

Document classification

* Example 1: gender identification · Classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard78

In [7]: featuresets =[(gender_features(n), g) for (n,g) in all_names]

In [8]: train_set = featuresets[500:]

In [9]: test_set = featuresets[:500]

In [10]: classifier = nltk.NaiveBayesClassifier.train(train_set)

In [11]: nltk.classify.accuracy(classifier, test_set)Out[11]: 0.778

Page 79: Lecture09

Document classification

* Example 2: POS tagging (solved by Decision Tree Classifier) · Results: POS tag · Features: Suffixes

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard79

Page 80: Lecture09

Document classification

* Example 2: POS tagging · Data

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard80

In [1]: from nltk.corpus import brown

In [2]: suffix_fdist = nltk.FreqDist()

In [3]: for word in brown.words():            word = word.lower()            suffix_fdist.inc(word[­1:])            suffix_fdist.inc(word[­2:])            suffix_fdist.inc(word[­3:])

In [4]: common_suffixes = suffix_fdist.keys()[:100]

Page 81: Lecture09

Document classification

* Example 2: POS tagging · Feature extractor

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard81

In [5]: def pos_features(word):            features = {}            for suffix in common_suffixes:                features['endswith(%s)' % suffix] =                              word.lower().endswith(suffix)            return features

Page 82: Lecture09

Document classification

* Example 2: POS tagging · Classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard82

In [6]: tagged_words = brown.tagged_words(categories='news')

In [7]: featuresets =[(pos_features(n), g) for (n,g) in tagged_words]

In [8]: size = int(len(featuresets) * 0.1)

In [9]: train_set, test_set =featuresets[size:], featuresets[:size]

Page 83: Lecture09

Document classification

* Example 2: POS tagging · Classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard83

In [10]: classifier = nltk.DecisionTreeClassifier.train(train_set)

In [11]: classifier.classify(pos_features('cats'))Out[11]: 'NNS'

In [12]: nltk.classify.accuracy(classifier, test_set)0.62705121829935351

Page 84: Lecture09

Document classification

* Example 3: document classification (solved by Naive Bayesian Classifier) · Corpus: Movie Reviews Corpus · Results: Positive or negative review · Features: Indicate whether or not the 2000 most frequent words are present in each review

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard84

Page 85: Lecture09

Document classification

* Example 3: document classification · Data

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard85

In [1]: from nltk.corpus import movie_reviews

In [2]: import random

In [3]: documents =            [(list(movie_reviews.words(fileid)), category)            for category in movie_reviews.categories()            for fileid in movie_reviews.fileids(category)]

In [4]: random.shuffle(documents)

Page 86: Lecture09

Document classification

* Example 3: document classification · Feature extractor

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard86

In [5]: all_words = nltk.FreqDist(            w.lower() for w in movie_reviews.words())

In [6]: word_features = all_words.keys()[:2000]

In [7]: def document_features(document):                  document_words = set(document)            features = {}            for word in word_features:                features['contains(%s)' % word] = \                    (word in document_words)            return features

Page 87: Lecture09

Document classification

* Example 3: document classification · Classification

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard87

In [7]: featuresets =[(document_features(d), c) for (d,c) in documents]

In [8]: train_set = featuresets[100:]

In [9]: test_set = featuresets[:100]

In [10]: classifier = nltk.NaiveBayesClassifier.train(train_set)

In [11]: nltk.classify.accuracy(classifier, test_set)Out[11]: 0.84

Page 88: Lecture09

Document classification

* Example 3: document classification · 5 most informative features

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard88

In [12]: classifier.show_most_informative_features(5)Most Informative Features   contains(outstanding) = True    pos : neg = 10.7 : 1.0         contains(mulan) = True    pos : neg =  9.0 : 1.0        contains(seagal) = True    neg : pos =  8.2 : 1.0   contains(wonderfully) = True    pos : neg =  6.4 : 1.0         contains(damon) = True    pos : neg =  6.4 : 1.0

Page 89: Lecture09

Document classification

* Exercise 2 · “Reuters-21578 benchmark corpus / ApteMod version” is a collection of 10,788 documents from the Reuters financial newswire service

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard89

Page 90: Lecture09

Document classification

* Exercise 2 · Train a naive Bayes classifier with ApteMod corpus · Use it to classify a document · Evalutate its performance

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard90

Page 91: Lecture09

Document classification

* Exercise 2 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard91

import nltkimport randomfrom nltk.corpus import reuters

Page 92: Lecture09

Document classification

* Exercise 2 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard92

documents = [(list(reuters.words(fileid)), category)    for category in reuters.categories()        for fileid in reuters.fileids(category)]random.shuffle(documents)

Page 93: Lecture09

Document classification

* Exercise 2 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard93

all_words = nltk.FreqDist(w.lower() for w in reuters.words())word_features = all_words.keys()[:2000]def document_features(document):    document_words = set(document)    features = {}    for word in word_features:        features['contains(%s)' % word] = \            (word in document_words)    return features

Page 94: Lecture09

Document classification

* Exercise 2 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard94

featuresets = [(document_features(d), c) for (d,c) in documents]size = int(len(featuresets) * 0.9)train_set = featuresets[size:]test_set = featuresets[:size]classifier = nltk.NaiveBayesClassifier.train(train_set)

Page 95: Lecture09

Document classification

* Exercise 2 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard95

document = reuters.words('test/14826')classifier.classify(document_features(document))nltk.classify.accuracy(classifier, test_set)

Page 96: Lecture09

References

Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text.” Natural Language Processing with

Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard96

Page 97: Lecture09

Text information extraction

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard97

Page 98: Lecture09

Information extraction

* Definition: · Convert unstructured data of natural language into structured data of table · Get information from tabulated data

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard98

Page 99: Lecture09

Information extraction

* Arquitecture:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard99

Page 100: Lecture09

Entity recognition

* Chunking · Segments and labels multitoken sequences · Selects a subset of the tokens (chunks) · Chunks do not overlap in the source text

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard100

Page 101: Lecture09

Entity recognition

* Chunking · Entities are mostly nouns · Let us search for the noun phrase chunks (NP-chunks) · Grammar: set of rules that indicate how sentences should be chunked

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard101

Page 102: Lecture09

Entity recognition

* NP-chunker

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard102

In [1]: import nltk, re, pprint

In [2]: grammar = r"""# chunk optional determiner/possessive, adjectives and nounsNP: {<DT|PP\$>?<JJ>*<NN>} # chunk sequences of proper nouns{<NNP>+}"""

In [3]: cp = nltk.RegexpParser(grammar)

Page 103: Lecture09

Entity recognition

* NP-chunker

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard103

In [4]: sentence1 = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]

In [5]: sentence2 = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

In [6]: result1 = cp.parse(sentence)

In [7]: result2 = cp.parse(sentence)

Page 104: Lecture09

Entity recognition

* NP-chunker

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard104

In [8]: print result1(S  (NP the/DT little/JJ yellow/JJ dog/NN)  barked/VBD  at/IN  (NP the/DT cat/NN))

In [9]: print result2(S  (NP Rapunzel/NNP)  let/VBD  down/RP  (NP her/PP$ long/JJ golden/JJ hair/NN))

Page 105: Lecture09

Entity recognition

* NP-chunker

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard105

In [10]: result1.draw()

Page 106: Lecture09

Entity recognition* Chunking text corpora

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard106

In [11]: for sent in brown.tagged_sents():tree = cp.parse(sent)for subtree in tree.subtrees():    if subtree.node == 'NP':        nps.append(subtree)

In [12]: for np in nps[:10]:print np(NP investigation/NN)(NP widespread/JJ interest/NN)(NP this/DT city/NN)(NP new/JJ multi­million­dollar/JJ airport/NN)(NP his/PP$ wife/NN)(NP His/PP$ political/JJ career/NN)...

Page 107: Lecture09

Entity recognition* Named entities · Are definite noun phrases · Refer to specific types of individuals:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard107

Page 108: Lecture09

Entity recognition

* Named entity recognition · Task well suited to classifier-based approach for noun phrase chunking

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard108

Page 109: Lecture09

Entity recognition* Named entity recognition · Example:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard109

In [1]: sent = nltk.corpus.treebank.tagged_sents()[22]

In [2]: print nltk.ne_chunk(sent)(S  The/DT  (GPE U.S./NNP)  is/VBZ  one/CD  ...  according/VBG  to/TO  (PERSON Brooke/NNP T./NNP Mossman/NNP)  ...)

Page 110: Lecture09

Relation extraction

* Extraction of relations that exists between the named entities recognized* Approach: initially look for all triples of the form (X, , Y)α

· X and Y are named entities of specific types · is the relationα

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard110

Page 111: Lecture09

Relation extraction

* Example:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard111

In [1]: import nltk

In [2]: import re

In [3]: IN = re.compile(r'.*\bin\b(?!\b.+ing)')

In [4]: for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):    for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):        print nltk.sem.relextract.show_raw_rtuple(rel)

Page 112: Lecture09

Relation extraction* Example:

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard112

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia'][ORG: 'McGlashan &AMP; Sarrail'] 'firm in' [LOC: 'San Mateo'][ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington'][ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington'][ORG: 'Idealab'] ', a self­described business incubator based in' [LOC: 'Los Angeles'][ORG: 'Open Text'] ', based in' [LOC: 'Waterloo'][ORG: 'WGBH'] 'in' [LOC: 'Boston'][ORG: 'Bastille Opera'] 'in' [LOC: 'Paris'][ORG: 'Omnicom'] 'in' [LOC: 'New York'][ORG: 'DDB Needham'] 'in' [LOC: 'New York'][ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York'][ORG: 'BBDO South'] 'in' [LOC: 'Atlanta'][ORG: 'Georgia­Pacific'] 'in' [LOC: 'Atlanta']

Page 113: Lecture09

Relation extraction

* Exercise 3 · From the corpus ieer, extract all the relations of type “people were born in a location”

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard113

Page 114: Lecture09

Relation extraction

* Exercise 3 · Extract all the relations of type “people were born in a location” from the corpus ieer

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard114

Page 115: Lecture09

Relation extraction

* Exercise 3 (solution)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard115

import nltkimport osimport re

BORN = re.compile(r'.*\bborn\b')files = filter(lambda x: x != 'README', os.listdir('nltk_data/corpora/ieer'))for f in files:    for doc in nltk.corpus.ieer.parsed_docs(f):        for rel in nltk.sem.extract_rels('PER', 'LOC', doc, corpus='ieer', pattern=BORN):            print nltk.sem.relextract.show_raw_rtuple(rel)

Page 116: Lecture09

References

Steven Bird, Ewan Klein, and Edward Loper. “Chapter 7: Extracting Information from Text.” Natural Language Processing

with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard116

Page 117: Lecture09

Assignment

* Assignment 9 · Readings + Supervised classification (Natural Language Processing with Python) + Decision Tree Learning (Machine Learning)

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard117

Page 118: Lecture09

References

Mitchell, Tom M. “Chapter 3: Decision Tree Learning.” Machine Learning. New York: McGraw-Hill, 1997. Print.

Steven Bird, Ewan Klein, and Edward Loper. “Chapter 6: Learning to Classify Text - Supervised Classification.” Natural

Language Processing with Python. O’Reilly Media, 2009. 504. shop.oreilly.com. Web. 8 Mar. 2014.

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard118

Page 119: Lecture09

Bibliography

“Frequency Distribution.” Wikipedia, the free encyclopedia 7 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.

Mitchell, Tom M. Machine Learning. New York: McGraw-Hill, 1997. Print.

“Part of Speech.” Wikipedia, the free encyclopedia 5 Apr. 2014. Wikipedia. Web. 8 Apr. 2014.

Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009. 504.

shop.oreilly.com. Web. 8 Mar. 2014.

Knowledge Representation in Digital HumanitiesAntonio Jiménez Mavillard119