KiwiPyCon 2014 - NLP with Python tutorial

Understanding human language with PythonAlyona Medelyan

Who am I?

Alyona Medelyan

▪ In Natural Language Processing since 2000▪ PhD in NLP & Machine Learning from Waikato▪ Author of the state-of-the-art keyword extraction algorithm Maui▪ Author of the most-cited 2009 journal survey “Mining Meaning with

Wikipedia”▪ Past: Chief Research Officer at Pingar ▪ Now: Founder of Entopix, NLP consultancy & software development

aka @zelandiya

Pre-tutorial survey results

ProgrammingPython

Beginers Experts

85% no experience with NLP,

general interest

Agenda

State of NLPRecap on fiction vs reality: Are we there yet?NLP ComplexitiesWhy is understanding language so complex?NLP using PythonLearning the basics, applying them, expanding into further topicsOther NLP areasAnd what’s coming next

State of NLP

Fiction versus Reality

He (KITT) “always had an ego that was easy to bruise and displayed a very sensitive, but kind and dryly

humorous personality.” - Wikipedia

Android Auto: “hands-free operation through voice commands

will be emphasized to ensure safe driving”

“by putting this into one's ear one can instantly understand anything said in any language” (Hitchhiker

Wiki)

WordLense:“augmented

reality translation”

The LCARS (or simply library computer) … used sophisticated artificial intelligence routines to

understand and execute vocal natural language commands (From Memory Alpha Wiki)

Let’s try out Google

It doesn’t always work… (the person searched for “Steve Jobs”)

“Samantha [the OS]proves to be constantly available, always curious and interested, supportive and undemanding”

Siri doesn’t seem to be as “available”

NLP Complexities

What is understanding language so complex?

Last week's GDP figures, which were 0.8% for the March quarter (average forecast was 0.4%) and included a revision of the December quarter figures from 0.2% to 0.5%... That takes away the rationale for the OCR to remain at stimulatory levels.It is currently at 2.5%.

Also, in fighting inflation, Dr. Bollard has one rather tricky ally - the exchange rate, which hit a record 85USc last week in N.Z. Running at that level, the currency is keeping imported inflation at low levels.

Sentence detection complexities

Word segmentation complexities

▪ 广大发展中国家一致支持这个目标，并提出了各自的期望细节。▪ 广大发展中国家一致支持这个目标，并提出了各自的期望细节。▪ The first hot dogs were sold by Charles Feltman on

Coney Island in 1870. ▪ The first hot dogs were sold by Charles Feltman on

Coney Island in 1870.

Disambiguation complexities

Flying planes can be dangerous

Sentiment complexities

from: http://www.sentic.net/tutorial/

NLP using Python

Learning the basics, applying them, expanding into further topics

import sysimport pocketsphinx if __name__ == "__main__": hmdir = "/usr/share/pocketsphinx/model/hmm/wsj1" lmdir = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.lm.DMP" dictd = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic" wavfile = sys.argv[1] speechRec = pocketsphinx.Decoder(hmm = hmdir, lm = lmdir, dict = dictd) wavFile = file(wavfile,'rb') speechRec.decode_raw(wavFile) result = speechRec.get_hyp() print result

Speech recognition with PythonUsing CMU Sphinx

http://www.confusedcoders.com/random/speech-recognition-in-python-with-cmu-

pocketsphinx

text text text

text text text text text text text text text text text texttext text text

sentimentkeywords tags

genrecategories

taxonomy termsentities

namespatterns biochemicalentities… text text text


What can we do with text?

text text text


sentimentkeywords tags

genrecategories

taxonomy termsentities

namespatterns biochemicalentities… text text text


What can we do with text?

practical partof this tutorial

Introducing NLTK – Python platform for NLP

Setting upClone or Download ZIP:https://github.com/zelandiya/KiwiPyCon-NLP-tutorial

Working with corpora in NLTK

>>> from nltk.corpus import movie_reviews

>>> print len(movie_reviews.fileids())

>>> print movie_reviews.categories()

>>> print movie_reviews.fileids('neg')[:10]>>> print movie_reviews.fileids('pos')[:10]

>>> print movie_reviews.words('pos/cv000_29590.txt')

>>> print movie_reviews.raw('pos/cv000_29590.txt')

>>> print movie_reviews.sents('pos/cv000_29590.txt')

NLTK Corpus – basic functionality

Getting to know text: Word frequencies

from nltk.corpus import movie_reviewsfrom nltk.probability import FreqDist

words = movie_reviews.words('pos/cv000_29590.txt')freqs = FreqDist(words)

print 'Most frequent words in review’, freqs.items()[:20]

for category in movie_reviews.categories():

print 'Category', category all_words = movie_reviews.words(categories=category) all_words_by_frequency = FreqDist(all_words) print all_words_by_frequency.items()[:20]

Output of “frequent words”

Most frequent words in review [('the', 46), (',', 43), ("'", 25), ('.', 23), ('and', 21), ... Category neg[(',', 35269), ('the', 35058), ('.', 32162), ('a', 17910), ...Category pos[(',', 42448), ('the', 41471), ('.', 33714), ('a', 20196), ...

How to get to the core words?

even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance

i think that from hell has a pretty solid acting, especially with the dreamy depp turning in a strong performance as he usually does

*

* “from hell” is the title of the movie, using just stopwords will not be sufficient to process this example correctly

Remove Stopwords!

Stopword removal with NLTK

from nltk.corpus import movie_reviewsfrom nltk.corpus import stopwords

stop = stopwords.words('english')

words = movie_reviews.words('pos/cv000_29590.txt')no_stops = [word for word in words if word not in stop]

NLTK Stopwords: Before & After

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',']

['films', 'adapted', 'comic', 'books', 'plenty', 'success', ',', 'whether', "'", 're', 'superheroes', '(', 'batman', ',’]

Part of speech tagging & filtering

import nltkfrom nltk.corpus import movie_reviewsfrom nltk.probability import FreqDist

words = movie_reviews.words('pos/cv000_29590.txt')pos = nltk.pos_tag(words)

filtered_words = [x[0] for x in pos if x[1] in ('NN', 'JJ')]

print FreqDist(filtered_words).items()[:20]

POS tagging & filtering results

[('films', 'NNS'), ('adapted', 'VBD'), ('from', 'IN'), ('comic', 'JJ'), ('books', 'NNS'), ('have', 'VBP'), ('had', 'VBN'), ('plenty', 'NN'), ('of', 'IN'), ('success', 'NN')

[('t', 9), ('comic', 5), ('film', 5), ('hell', 5), ('book', 3), ('campbell', 3), ('don', 3), ('ripper', 3), ('abberline', 2), ('accent', 2), ('depp', 2), ('end', 2),

From Single to Multi-Word Phrases

NEJM usually has the highest impact factor of the journals of clinical medicine.

highest, highest impact, highest impact factor

ignore stopwords

Option 1. Ngrams

Option 2. Chunking / POS patterns

from http://www.nltk.org/book/ch07.html#chap-chunk

Ngram extraction with NLTK

my_ngrams = []for n in range(2, 5): for gram in ngrams(words, n): if acceptable(gram[0]) \ and acceptable(gram[-1]) \

and has_no_boundaries(gram): phrase = ' '.join(gram) my_ngrams.append(phrase)

[("' s", 11), ("' t", 10), (', but', 6), ("don '", 5), ("don ' t", 5), ('from hell', 5)[('comic book', 2), ('jack the ripper', 2), ('moore and campbell', 2), ('say moore', 2),

Corpus statistics: TFxIDF

TFxIDF with Gensim

from nltk.corpus import movie_reviewsfrom gensim import corpora, models

texts = []for fileid in movie_reviews.fileids(): words = movie_reviews.words(fileid) texts.append(words)

dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]tfidf = models.TfidfModel(corpus)

for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: id = dictionary.token2id.get(word)

print word, id, tfidf.idfs[id]

TFxIDF with Gensim: Results

film 124 0.190174003903movie 207 0.364013496254comedy 653 1.98564470702violence 1382 3.2108967825jolie 9418 6.96578428466

NLP using Python


How a keyword extraction algorithm works

Document KeywordsCandidates Properties Scoring

Slide windowBreak at stopwords & punctuationNormalizeMap to vocabulary (optional)Disambiguate (optional)

Calculate:Frequency of occurrencesPosition in the documentPhrase lengthSimilarity to other candidatesProminence in this particular textPart of speech patternIs it a popular keyword?

Heuristic formulathat combines mostpowerful properties

Supervised machine learning that learns the importance of properties frommanually assigned keywords

OR

Candidates extraction in Python

def get_candidates(words, stop):

filtered_words = [word for word in words if word not in stop and word[0].isalpha()] text_ngrams = get_ngrams(words, stop) return filtered_words + text_ngrams

Candidate scoring in Python

def score_candidates(candidates, dictionary, tfidf): scores = {} freqs = FreqDist(candidates) for word in set(candidates): tf = float(frequencies[word]) / len(freqs) id = dictionary.token2id.get(word) if id: idf = tfidf.idfs[id] else: idf = 0 scores[word] = tf*idf

return sorted(scores.iteritems(), key=operator.itemgetter(1), reverse = True)

Test keywords extractor

bellboyjennifer bealsfour roomsbealsroomstarantinomadonnaantonio banderasvaleria golino

…four of the biggest directors in hollywood : quentin tarantino , robert rodriguez , … were all directing one big film with a big and popular cast ...the second room ( jennifer beals ) was better , but lacking in plot ... the bumbling and mumbling bellboy , and he ruins every joke in the film …

Analysis of the results

• Remove sub-phrases in favour of higher ranked ones• Score higher Adjectives & Adverb using Part of Speech tagging• Add stemming• …

neg/cv480_21195.txt fight club, club, fight, se7en and the game, inter - office, inter - office politics, tyler, office politics, politics, woven, inter, befuddled

neg/cv235_10704.txt babysitter, goal of the babysitter, thug, boyfriend, goal, fails, fantasizes, dream sequences, silverstone, dream

neg/cv248_15672.txt vampires, vampire, rude, suggestion, regressive movieneg/cv136_12384.txt lost in space, robinson, robinsons, story changes, cartoony

Getting insights from text!

Which actors, directors, movie plots and film qualities make a successful movie?

1. Apply candidate extraction on each review (to initialize TFxIDF scorer)2. Extract common keywords from positive and negative reviews

Insights – Step 1

from nltk.corpus import movie_reviewsfrom nltk.probability import FreqDistfrom basics_applied import keyword_extractor

candidate_extractor = keyword_extractor.CandidateExtractor()

texts = []texts_ids = {}count = 0for fileid in movie_reviews.fileids(): words = candidate_extractor.run(movie_reviews.words(fileid)) texts.append(words) texts_ids[fileid] = count count += 1

Insights – Step 2


print 'Category', category category_keywords = [] for fileid in movie_reviews.fileids(categories=category):

count = texts_ids[fileid] candidates = texts[count] keywords = candidate_scorer.run(candidates)[:20] for keyword in keywords: category_keywords.append(keyword[0]) if ' ' in keyword[0]: category_keywords.append(keyword[0]) cat_keywords_by_frequency = FreqDist(category_keywords) print cat_keywords_by_frequency.items()[:50]

Our insights

van damme 16zeta - jones 16smith 15batman 14de palma 14eddie murphy 14killer 14tommy lee jones 14wild west 14mars 13murphy 13ship 13space 13brothers 12de bont 12...

star wars 26disney 23war 23de niro 22jackie 21alien 20jackie chan 20private ryan 20truman show 20ben stiller 18cameron 18science fiction 18cameron diaz 16fiction 16jack 16...

Negative Positive

NLP using Python


Text Categorization

textanddatamining.blogspot.co.nz/2011/07/svm-classification-intuitive.html

Entertainment

Politics

TVNZ: “Obama and Hangover star trade insults in interview”

Categorization vs Keyword Extraction

source ofterminology

numberof topics

any

document

vocabulary

domain-relevantmain topics onlyvery few

keyword assignment

term assignment

tagging

keyword extraction

all possible

text categorization

terminology extractiontopic modeling

full-text indexing

Text Classification with Python

documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())word_features = all_words.keys()[:2000]

# document_features: for word in word_features:# features['contains(%s)' % word] = (word in doc_words)

featuresets = [(document_features(d), c) for (d,c) in documents]train_set, test_set = featuresets[1000:], featuresets[:1000]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

Classify new reviews using NLTK

# from http://www.imdb.com/title/tt2209764/reviews?ref_=tt_urvtranscendence = ['../data/transcendence_1star.txt', '../data/transcendence_5star.txt', '../data/transcendence_8star.txt', '../data/transcendence_great.txt']

classifier = nltk.NaiveBayesClassifier.train(featuresets)

for review in transcendence: f = open(review) raw = f.read() document = word_tokenize(raw) features = document_features(document) print review, classifier.classify(features)

Sentiment analysis with TextBlob

from textblob import TextBlobfrom textblob.sentiments import NaiveBayesAnalyzer

blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())print blob.sentimentSentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)

blob = TextBlob("I love this library")print blob.sentimentSentiment(polarity=0.5, subjectivity=0.6)

Sentiment Categorization with Text Blob

for review in transcendence: f = open(review) raw = f.read() blob = TextBlob(raw) sentiment = blob.sentiment if sentiment.polarity > 0.20: print review, 'pos', round(sentiment.polarity, 3),

round(sentiment.subjectivity, 3) else: print review, 'neg', round(sentiment.polarity, 3),

round(sentiment.subjectivity, 3)

../data/transcendence_1star.txt neg 0.017 0.502

../data/transcendence_5star.txt neg 0.087 0.51

../data/transcendence_8star.txt pos 0.257 0.494

../data/transcendence_great.txt pos 0.304 0.528

Sentiment analysis: Aspects

http://www.sentic.net/tutorial/

Topic modeling

http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

Insights through Topic Modeling with GenSim

candidate_extractor = basics_applied.keyword_extractor.CandidateExtractor()


texts = [] for fileid in movie_reviews.fileids(category): words = movie_reviews.words(fileid) clean_words = texts.append(candidate_extractor.run(words, 2))

dictionary = corpora.Dictionary(texts) dictionary.filter_extremes(no_below=10, no_above=0.1, keep_n=10000) corpus = [dictionary.doc2bow(text) for text in texts] print 'Category', category print 'LDA' lda = models.ldamodel.LdaModel(corpus, id2word=dictionary)

print 'HDP' model = models.hdpmodel.HdpModel(corpus, id2word=dictionary)

Insights

topic 0: acting ability + battle scenes + pretty much + mission to mars + natasha henstridge + live action + ve never + freddie prinze jr topic 1: bad acting + naked gun + lead role + close - ups + antonio banderas + johnny depp + nothing else + kind of movie + wild wild westtopic 2: salma hayek + woody allen + pulp fiction + next time + make sense + make a movie + target audience + opening sequencetopic 3: subject matter + horror movie + first one + anyone else + throughout the movie + granger movie + end credits + never seen topic 4: million dollars + ll see + deep impact + de palma + watching the film + granger movie gauge + didn ' t like + makes no sense

topic 0: martin scorsese + soap opera + fbi agent + old man + first thing + doesn ' t make + entertaining film + first - time + doesn ' t knowtopic 1: stanley kubrick + matt dillon + film i ' ve + time period + film like + last two + computer animation + men and women + whole film topic 2: action film + good and evil + star trek + usual suspects + soon becomes + written and directed + time period + new york + first movietopic 3: julianne moore + feature film + tom cruise + doesn ' t want + real people + much better + action sequences + see the movie topic 4: re looking + soap opera + austin powers + edward norton + entertaining film + well enough + old - fashioned + animated feature

Negative

Positive

LDA: Practical application

Sweaty Horse Blanket: Processing the Natural Language of Beerby Ben Fields

https://speakerdeck.com/bfields

1. Keyword extraction 2. TFxIDF scoring3. LDA

Other NLP areas

What’s coming next?

From Strings to Concepts

most likelyless likely

unlikely

toolPrecc is a new compiler-compiler that is much more versatile than yacc.

✓

From Concepts to Facts

Applying the Semantic Web technology

▪ Show all politicians, their birth date and gender, mentioned in the document collection and in which documents they appear

Al Gore31-03-1948

male

Al Green01-09-1947

male

Alan Hunt09-10-1927

male

Alberto Fujimori28-07-1938

maleBarack Obama

04-08-1961male

Benazir Bhutto21-06-1953

female

…

SemanticSPARQLQuery

select distinct ?name ?birth ?gender where { graph <http://some.url/> …

Parsing

/m/0d3k14

/m/044sb

/m/0d3k14

… Jack Ruby, who killed J.F.Kennedy's assassin Lee Harvey Oswald. …

Sentiment0% Positive30% Neutral70% Negative

Freebase

What’s next?

Vs.

Conclusions:Understanding human language with Python

State of NLPRecap on fiction vs reality: Are we there yet?NLP ComplexitiesWhy is understanding language so complex?NLP using PythonNLTK, Gensim & TextBlobOther NLP areasAnd what’s coming next

PyNLPl

Try also:

clips.ua.ac.be/pages/patternPattern

scikit-learn.org/stable/

github.com/proycon/pynlpl

KiwiPyCon 2014 - NLP with Python tutorial

Technology