Top Banner
Understanding human language with Python Alyona Medelyan
70

KiwiPyCon 2014 - NLP with Python tutorial

Jan 14, 2017

Download

Technology

Alyona Medelyan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: KiwiPyCon 2014 - NLP with Python tutorial

Understanding human language with PythonAlyona Medelyan

Page 2: KiwiPyCon 2014 - NLP with Python tutorial

Who am I?

Alyona Medelyan

▪ In Natural Language Processing since 2000▪ PhD in NLP & Machine Learning from Waikato▪ Author of the state-of-the-art keyword extraction algorithm Maui▪ Author of the most-cited 2009 journal survey “Mining Meaning with

Wikipedia”▪ Past: Chief Research Officer at Pingar ▪ Now: Founder of Entopix, NLP consultancy & software development

aka @zelandiya

Page 3: KiwiPyCon 2014 - NLP with Python tutorial

Pre-tutorial survey results

ProgrammingPython

Beginers Experts

85% no experience with NLP,

general interest

Page 4: KiwiPyCon 2014 - NLP with Python tutorial

Agenda

State of NLPRecap on fiction vs reality: Are we there yet?NLP ComplexitiesWhy is understanding language so complex?NLP using PythonLearning the basics, applying them, expanding into further topicsOther NLP areasAnd what’s coming next

Page 5: KiwiPyCon 2014 - NLP with Python tutorial

State of NLP

Fiction versus Reality

Page 6: KiwiPyCon 2014 - NLP with Python tutorial

He (KITT) “always had an ego that was easy to bruise and displayed a very sensitive, but kind and dryly

humorous personality.” - Wikipedia

Page 7: KiwiPyCon 2014 - NLP with Python tutorial

Android Auto: “hands-free operation through voice commands

will be emphasized to ensure safe driving”

Page 8: KiwiPyCon 2014 - NLP with Python tutorial

“by putting this into one's ear one can instantly understand anything said in any language” (Hitchhiker

Wiki)

Page 9: KiwiPyCon 2014 - NLP with Python tutorial

WordLense:“augmented

reality translation”

Page 10: KiwiPyCon 2014 - NLP with Python tutorial

The LCARS (or simply library computer) … used sophisticated artificial intelligence routines to

understand and execute vocal natural language commands (From Memory Alpha Wiki)

Page 11: KiwiPyCon 2014 - NLP with Python tutorial

Let’s try out Google

Page 12: KiwiPyCon 2014 - NLP with Python tutorial

It doesn’t always work… (the person searched for “Steve Jobs”)

Page 13: KiwiPyCon 2014 - NLP with Python tutorial

“Samantha [the OS]proves to be constantly available, always curious and interested, supportive and undemanding”

Page 14: KiwiPyCon 2014 - NLP with Python tutorial

Siri doesn’t seem to be as “available”

Page 15: KiwiPyCon 2014 - NLP with Python tutorial

NLP Complexities

What is understanding language so complex?

Page 16: KiwiPyCon 2014 - NLP with Python tutorial
Page 17: KiwiPyCon 2014 - NLP with Python tutorial

Last week's GDP figures, which were 0.8% for the March quarter (average forecast was 0.4%) and included a revision of the December quarter figures from 0.2% to 0.5%... That takes away the rationale for the OCR to remain at stimulatory levels.It is currently at 2.5%.

Also, in fighting inflation, Dr. Bollard has one rather tricky ally - the exchange rate, which hit a record 85USc last week in N.Z. Running at that level, the currency is keeping imported inflation at low levels.

Sentence detection complexities

Page 18: KiwiPyCon 2014 - NLP with Python tutorial

Word segmentation complexities

▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ 广大发展中国家一致支持这个目标,并提出了各自的期望细节。▪ The first hot dogs were sold by Charles Feltman on

Coney Island in 1870. ▪ The first hot dogs were sold by Charles Feltman on

Coney Island in 1870.

Page 19: KiwiPyCon 2014 - NLP with Python tutorial

Disambiguation complexities

Flying planes can be dangerous

Page 20: KiwiPyCon 2014 - NLP with Python tutorial

Sentiment complexities

from: http://www.sentic.net/tutorial/

Page 21: KiwiPyCon 2014 - NLP with Python tutorial

NLP using Python

Learning the basics, applying them, expanding into further topics

Page 22: KiwiPyCon 2014 - NLP with Python tutorial

import sysimport pocketsphinx if __name__ == "__main__":    hmdir = "/usr/share/pocketsphinx/model/hmm/wsj1"   lmdir = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.lm.DMP"   dictd = "/usr/share/pocketsphinx/model/lm/wsj/wlist5o.dic"   wavfile = sys.argv[1]    speechRec = pocketsphinx.Decoder(hmm = hmdir, lm = lmdir, dict = dictd)   wavFile = file(wavfile,'rb')   speechRec.decode_raw(wavFile)   result = speechRec.get_hyp()    print result

Speech recognition with PythonUsing CMU Sphinx

http://www.confusedcoders.com/random/speech-recognition-in-python-with-cmu-

pocketsphinx

Page 23: KiwiPyCon 2014 - NLP with Python tutorial

text text text

text text text text text text text text text text text texttext text text

sentimentkeywords tags

genrecategories

taxonomy termsentities

namespatterns biochemicalentities… text text text

text text text text text text text text text text text texttext text text

What can we do with text?

Page 24: KiwiPyCon 2014 - NLP with Python tutorial

text text text

text text text text text text text text text text text texttext text text

sentimentkeywords tags

genrecategories

taxonomy termsentities

namespatterns biochemicalentities… text text text

text text text text text text text text text text text texttext text text

What can we do with text?

practical partof this tutorial

Page 25: KiwiPyCon 2014 - NLP with Python tutorial

Introducing NLTK – Python platform for NLP

Page 26: KiwiPyCon 2014 - NLP with Python tutorial

Setting upClone or Download ZIP:https://github.com/zelandiya/KiwiPyCon-NLP-tutorial

Page 27: KiwiPyCon 2014 - NLP with Python tutorial

Working with corpora in NLTK

>>> from nltk.corpus import movie_reviews

>>> print len(movie_reviews.fileids())

>>> print movie_reviews.categories()

>>> print movie_reviews.fileids('neg')[:10]>>> print movie_reviews.fileids('pos')[:10]

>>> print movie_reviews.words('pos/cv000_29590.txt')

>>> print movie_reviews.raw('pos/cv000_29590.txt')

>>> print movie_reviews.sents('pos/cv000_29590.txt')

Page 28: KiwiPyCon 2014 - NLP with Python tutorial

NLTK Corpus – basic functionality

Page 29: KiwiPyCon 2014 - NLP with Python tutorial

Getting to know text: Word frequencies

from nltk.corpus import movie_reviewsfrom nltk.probability import FreqDist

words = movie_reviews.words('pos/cv000_29590.txt')freqs = FreqDist(words)

print 'Most frequent words in review’, freqs.items()[:20]

for category in movie_reviews.categories():

print 'Category', category all_words = movie_reviews.words(categories=category) all_words_by_frequency = FreqDist(all_words) print all_words_by_frequency.items()[:20]

Page 30: KiwiPyCon 2014 - NLP with Python tutorial

Output of “frequent words”

Most frequent words in review [('the', 46), (',', 43), ("'", 25), ('.', 23), ('and', 21), ... Category neg[(',', 35269), ('the', 35058), ('.', 32162), ('a', 17910), ...Category pos[(',', 42448), ('the', 41471), ('.', 33714), ('a', 20196), ...

Page 31: KiwiPyCon 2014 - NLP with Python tutorial

How to get to the core words?

even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance

i think that from hell has a pretty solid acting, especially with the dreamy depp turning in a strong performance as he usually does

*

* “from hell” is the title of the movie, using just stopwords will not be sufficient to process this example correctly

Remove Stopwords!

Page 32: KiwiPyCon 2014 - NLP with Python tutorial

Stopword removal with NLTK

from nltk.corpus import movie_reviewsfrom nltk.corpus import stopwords

stop = stopwords.words('english')

words = movie_reviews.words('pos/cv000_29590.txt')no_stops = [word for word in words if word not in stop]

Page 33: KiwiPyCon 2014 - NLP with Python tutorial

NLTK Stopwords: Before & After

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',']

['films', 'adapted', 'comic', 'books', 'plenty', 'success', ',', 'whether', "'", 're', 'superheroes', '(', 'batman', ',’]

Page 34: KiwiPyCon 2014 - NLP with Python tutorial

Part of speech tagging & filtering

import nltkfrom nltk.corpus import movie_reviewsfrom nltk.probability import FreqDist

words = movie_reviews.words('pos/cv000_29590.txt')pos = nltk.pos_tag(words)

filtered_words = [x[0] for x in pos if x[1] in ('NN', 'JJ')]

print FreqDist(filtered_words).items()[:20]

Page 35: KiwiPyCon 2014 - NLP with Python tutorial

POS tagging & filtering results

[('films', 'NNS'), ('adapted', 'VBD'), ('from', 'IN'), ('comic', 'JJ'), ('books', 'NNS'), ('have', 'VBP'), ('had', 'VBN'), ('plenty', 'NN'), ('of', 'IN'), ('success', 'NN')

[('t', 9), ('comic', 5), ('film', 5), ('hell', 5), ('book', 3), ('campbell', 3), ('don', 3), ('ripper', 3), ('abberline', 2), ('accent', 2), ('depp', 2), ('end', 2),

Page 36: KiwiPyCon 2014 - NLP with Python tutorial

From Single to Multi-Word Phrases

NEJM usually has the highest impact factor of the journals of clinical medicine.

highest, highest impact, highest impact factor

ignore stopwords

Option 1. Ngrams

Option 2. Chunking / POS patterns

from http://www.nltk.org/book/ch07.html#chap-chunk

Page 37: KiwiPyCon 2014 - NLP with Python tutorial

Ngram extraction with NLTK

my_ngrams = []for n in range(2, 5): for gram in ngrams(words, n): if acceptable(gram[0]) \ and acceptable(gram[-1]) \

and has_no_boundaries(gram): phrase = ' '.join(gram) my_ngrams.append(phrase)

[("' s", 11), ("' t", 10), (', but', 6), ("don '", 5), ("don ' t", 5), ('from hell', 5)[('comic book', 2), ('jack the ripper', 2), ('moore and campbell', 2), ('say moore', 2),

Page 38: KiwiPyCon 2014 - NLP with Python tutorial

Corpus statistics: TFxIDF

Page 39: KiwiPyCon 2014 - NLP with Python tutorial

TFxIDF with Gensim

from nltk.corpus import movie_reviewsfrom gensim import corpora, models

texts = []for fileid in movie_reviews.fileids(): words = movie_reviews.words(fileid) texts.append(words)

dictionary = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]tfidf = models.TfidfModel(corpus)

for word in ['film', 'movie', 'comedy', 'violence', 'jolie']: id = dictionary.token2id.get(word)

print word, id, tfidf.idfs[id]

Page 40: KiwiPyCon 2014 - NLP with Python tutorial

TFxIDF with Gensim: Results

film 124 0.190174003903movie 207 0.364013496254comedy 653 1.98564470702violence 1382 3.2108967825jolie 9418 6.96578428466

Page 41: KiwiPyCon 2014 - NLP with Python tutorial

NLP using Python

Learning the basics, applying them, expanding into further topics

Page 42: KiwiPyCon 2014 - NLP with Python tutorial

How a keyword extraction algorithm works

Document KeywordsCandidates Properties Scoring

Slide windowBreak at stopwords & punctuationNormalizeMap to vocabulary (optional)Disambiguate (optional)

Calculate:Frequency of occurrencesPosition in the documentPhrase lengthSimilarity to other candidatesProminence in this particular textPart of speech patternIs it a popular keyword?

Heuristic formulathat combines mostpowerful properties

Supervised machine learning that learns the importance of properties frommanually assigned keywords

OR

Page 43: KiwiPyCon 2014 - NLP with Python tutorial

Candidates extraction in Python

def get_candidates(words, stop):

filtered_words = [word for word in words if word not in stop and word[0].isalpha()] text_ngrams = get_ngrams(words, stop) return filtered_words + text_ngrams

Page 44: KiwiPyCon 2014 - NLP with Python tutorial

Candidate scoring in Python

def score_candidates(candidates, dictionary, tfidf): scores = {} freqs = FreqDist(candidates) for word in set(candidates): tf = float(frequencies[word]) / len(freqs) id = dictionary.token2id.get(word) if id: idf = tfidf.idfs[id] else: idf = 0 scores[word] = tf*idf

return sorted(scores.iteritems(), key=operator.itemgetter(1), reverse = True)

Page 45: KiwiPyCon 2014 - NLP with Python tutorial

Test keywords extractor

bellboyjennifer bealsfour roomsbealsroomstarantinomadonnaantonio banderasvaleria golino

…four of the biggest directors in hollywood : quentin tarantino , robert rodriguez , … were all directing one big film with a big and popular cast ...the second room ( jennifer beals ) was better , but lacking in plot ... the bumbling and mumbling bellboy , and he ruins every joke in the film …

Page 46: KiwiPyCon 2014 - NLP with Python tutorial

Analysis of the results

• Remove sub-phrases in favour of higher ranked ones• Score higher Adjectives & Adverb using Part of Speech tagging• Add stemming• …

neg/cv480_21195.txt fight club, club, fight, se7en and the game, inter - office, inter - office politics, tyler, office politics, politics, woven, inter, befuddled

neg/cv235_10704.txt babysitter, goal of the babysitter, thug, boyfriend, goal, fails, fantasizes, dream sequences, silverstone, dream

neg/cv248_15672.txt vampires, vampire, rude, suggestion, regressive movieneg/cv136_12384.txt lost in space, robinson, robinsons, story changes, cartoony

Page 47: KiwiPyCon 2014 - NLP with Python tutorial

Getting insights from text!

Which actors, directors, movie plots and film qualities make a successful movie?

1. Apply candidate extraction on each review (to initialize TFxIDF scorer)2. Extract common keywords from positive and negative reviews

Page 48: KiwiPyCon 2014 - NLP with Python tutorial

Insights – Step 1

from nltk.corpus import movie_reviewsfrom nltk.probability import FreqDistfrom basics_applied import keyword_extractor

candidate_extractor = keyword_extractor.CandidateExtractor()

texts = []texts_ids = {}count = 0for fileid in movie_reviews.fileids(): words = candidate_extractor.run(movie_reviews.words(fileid)) texts.append(words) texts_ids[fileid] = count count += 1

Page 49: KiwiPyCon 2014 - NLP with Python tutorial

Insights – Step 2

for category in movie_reviews.categories():

print 'Category', category category_keywords = [] for fileid in movie_reviews.fileids(categories=category):

count = texts_ids[fileid] candidates = texts[count] keywords = candidate_scorer.run(candidates)[:20] for keyword in keywords: category_keywords.append(keyword[0]) if ' ' in keyword[0]: category_keywords.append(keyword[0]) cat_keywords_by_frequency = FreqDist(category_keywords) print cat_keywords_by_frequency.items()[:50]

Page 50: KiwiPyCon 2014 - NLP with Python tutorial

Our insights

van damme 16zeta - jones 16smith 15batman 14de palma 14eddie murphy 14killer 14tommy lee jones 14wild west 14mars 13murphy 13ship 13space 13brothers 12de bont 12...

star wars 26disney 23war 23de niro 22jackie 21alien 20jackie chan 20private ryan 20truman show 20ben stiller 18cameron 18science fiction 18cameron diaz 16fiction 16jack 16...

Negative Positive

Page 51: KiwiPyCon 2014 - NLP with Python tutorial

NLP using Python

Learning the basics, applying them, expanding into further topics

Page 52: KiwiPyCon 2014 - NLP with Python tutorial

Text Categorization

textanddatamining.blogspot.co.nz/2011/07/svm-classification-intuitive.html

Entertainment

Politics

TVNZ: “Obama and Hangover star trade insults in interview”

Page 53: KiwiPyCon 2014 - NLP with Python tutorial

Categorization vs Keyword Extraction

source ofterminology

numberof topics

any

document

vocabulary

domain-relevantmain topics onlyvery few

keyword assignment

term assignment

tagging

keyword extraction

all possible

text categorization

terminology extractiontopic modeling

full-text indexing

Page 54: KiwiPyCon 2014 - NLP with Python tutorial

Text Classification with Python

documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())word_features = all_words.keys()[:2000]

# document_features: for word in word_features:# features['contains(%s)' % word] = (word in doc_words)

featuresets = [(document_features(d), c) for (d,c) in documents]train_set, test_set = featuresets[1000:], featuresets[:1000]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

Page 55: KiwiPyCon 2014 - NLP with Python tutorial

Classify new reviews using NLTK

# from http://www.imdb.com/title/tt2209764/reviews?ref_=tt_urvtranscendence = ['../data/transcendence_1star.txt', '../data/transcendence_5star.txt', '../data/transcendence_8star.txt', '../data/transcendence_great.txt']

classifier = nltk.NaiveBayesClassifier.train(featuresets)

for review in transcendence: f = open(review) raw = f.read() document = word_tokenize(raw) features = document_features(document) print review, classifier.classify(features)

Page 56: KiwiPyCon 2014 - NLP with Python tutorial

Sentiment analysis with TextBlob

from textblob import TextBlobfrom textblob.sentiments import NaiveBayesAnalyzer

blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())print blob.sentimentSentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)

blob = TextBlob("I love this library")print blob.sentimentSentiment(polarity=0.5, subjectivity=0.6)

Page 57: KiwiPyCon 2014 - NLP with Python tutorial

Sentiment Categorization with Text Blob

for review in transcendence: f = open(review) raw = f.read() blob = TextBlob(raw) sentiment = blob.sentiment if sentiment.polarity > 0.20: print review, 'pos', round(sentiment.polarity, 3),

round(sentiment.subjectivity, 3) else: print review, 'neg', round(sentiment.polarity, 3),

round(sentiment.subjectivity, 3)

../data/transcendence_1star.txt neg 0.017 0.502

../data/transcendence_5star.txt neg 0.087 0.51

../data/transcendence_8star.txt pos 0.257 0.494

../data/transcendence_great.txt pos 0.304 0.528

Page 58: KiwiPyCon 2014 - NLP with Python tutorial

Sentiment analysis: Aspects

http://www.sentic.net/tutorial/

Page 59: KiwiPyCon 2014 - NLP with Python tutorial

Topic modeling

http://www.cs.princeton.edu/~blei/papers/Blei2012.pdf

Page 60: KiwiPyCon 2014 - NLP with Python tutorial

Insights through Topic Modeling with GenSim

candidate_extractor = basics_applied.keyword_extractor.CandidateExtractor()

for category in movie_reviews.categories():

texts = [] for fileid in movie_reviews.fileids(category): words = movie_reviews.words(fileid) clean_words = texts.append(candidate_extractor.run(words, 2))

dictionary = corpora.Dictionary(texts) dictionary.filter_extremes(no_below=10, no_above=0.1, keep_n=10000) corpus = [dictionary.doc2bow(text) for text in texts] print 'Category', category print 'LDA' lda = models.ldamodel.LdaModel(corpus, id2word=dictionary)

print 'HDP' model = models.hdpmodel.HdpModel(corpus, id2word=dictionary)

Page 61: KiwiPyCon 2014 - NLP with Python tutorial

Insights

topic 0: acting ability + battle scenes + pretty much + mission to mars + natasha henstridge + live action + ve never + freddie prinze jr topic 1: bad acting + naked gun + lead role + close - ups + antonio banderas + johnny depp + nothing else + kind of movie + wild wild westtopic 2: salma hayek + woody allen + pulp fiction + next time + make sense + make a movie + target audience + opening sequencetopic 3: subject matter + horror movie + first one + anyone else + throughout the movie + granger movie + end credits + never seen topic 4: million dollars + ll see + deep impact + de palma + watching the film + granger movie gauge + didn ' t like + makes no sense

topic 0: martin scorsese + soap opera + fbi agent + old man + first thing + doesn ' t make + entertaining film + first - time + doesn ' t knowtopic 1: stanley kubrick + matt dillon + film i ' ve + time period + film like + last two + computer animation + men and women + whole film topic 2: action film + good and evil + star trek + usual suspects + soon becomes + written and directed + time period + new york + first movietopic 3: julianne moore + feature film + tom cruise + doesn ' t want + real people + much better + action sequences + see the movie topic 4: re looking + soap opera + austin powers + edward norton + entertaining film + well enough + old - fashioned + animated feature

Negative

Positive

Page 62: KiwiPyCon 2014 - NLP with Python tutorial

LDA: Practical application

Sweaty Horse Blanket: Processing the Natural Language of Beerby Ben Fields

Page 63: KiwiPyCon 2014 - NLP with Python tutorial

1. Keyword extraction 2. TFxIDF scoring3. LDA

Page 64: KiwiPyCon 2014 - NLP with Python tutorial

Other NLP areas

What’s coming next?

Page 65: KiwiPyCon 2014 - NLP with Python tutorial

From Strings to Concepts

most likelyless likely

unlikely

toolPrecc is a new compiler-compiler that is much more versatile than yacc.

Page 66: KiwiPyCon 2014 - NLP with Python tutorial

From Concepts to Facts

Page 67: KiwiPyCon 2014 - NLP with Python tutorial

Applying the Semantic Web technology

▪ Show all politicians, their birth date and gender, mentioned in the document collection and in which documents they appear

Al Gore31-03-1948

male

Al Green01-09-1947

male

Alan Hunt09-10-1927

male

Alberto Fujimori28-07-1938

maleBarack Obama

04-08-1961male

Benazir Bhutto21-06-1953

female

SemanticSPARQLQuery

select distinct ?name ?birth ?gender where { graph <http://some.url/> …

Page 68: KiwiPyCon 2014 - NLP with Python tutorial

Parsing

/m/0d3k14

/m/044sb

/m/0d3k14

… Jack Ruby, who killed J.F.Kennedy's assassin Lee Harvey Oswald. …

Sentiment0% Positive30% Neutral70% Negative

Freebase

Page 69: KiwiPyCon 2014 - NLP with Python tutorial

What’s next?

Vs.

Page 70: KiwiPyCon 2014 - NLP with Python tutorial

Conclusions:Understanding human language with Python

State of NLPRecap on fiction vs reality: Are we there yet?NLP ComplexitiesWhy is understanding language so complex?NLP using PythonNLTK, Gensim & TextBlobOther NLP areasAnd what’s coming next

PyNLPl

Try also:

clips.ua.ac.be/pages/patternPattern

scikit-learn.org/stable/

github.com/proycon/pynlpl