Top Banner
Text Analysis with Python Vijay Ramachandran
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text analysis using python

Text Analysis with Python

Vijay Ramachandran

Page 2: Text analysis using python

Fools Rush In?”The fact is, that to do anything in the world worth doing, we must not stand back shivering and thinking of the cold and danger, but jump in and scramble through as well as we can.” - Robert Cushing

Page 3: Text analysis using python

Motivation

Machine Learning everywhere

Users expectations of ”standard experience”

Many Resources!

Page 4: Text analysis using python

Text Mining

Extract high quality information from text

Typically, trends and patterns are analysed using statistical methods – Machine Learning

Common Tasks – entity recognition, sentiment analysis, categorization, clustering

Page 5: Text analysis using python

Why Python?

Short, concise text processing NLTK Scipy, numpy, scikit.learn Integration with other languages!

Because when you start your company, YOU get to decide!

Page 6: Text analysis using python

Pre-processing Lower casing, stripping extra characters

”Realyyyyyyyyyy!!!!!” Tokenisation (sentences, words)

>>>pktst = nltk.data.load('tokenizers/punkt/english.pickle')

>>>sentences = pktst.tokenize(tweet)

>>>words = nltk.word_tokenize(sent)

Handling Entities

>>> re.sub(r'(^| )@[^ ]+','',tweet).strip()

Removing stopwords >>> stopwords = set([”a”, ”an”, ”the”, ”by”])

>>> ' '.join([w for w in words if w not in stopwords])

Page 7: Text analysis using python

Leveraging a Corpus

Simple techniques to analyze a domain Term Frequency to find important entities

”low light photography”, ”travel photography”

tf/idf to find representative terms across domains ”gaming” for TVs

GND to find aliases e.g., ”e700” and ”Samsung e700” vs ”e310” and

”Samsung e310” Yahoo BOSS is great!

Page 8: Text analysis using python

Supervised Classification

Page 9: Text analysis using python

Bayes Theorem

Conditional Probability Bayesian classifiers - given features, find

Probability of Class

P C∣F i , ... , F n=P C ∏

iP F i∣C

Page 10: Text analysis using python

Features

Characteristics of to-be-classified object For text, typically unigrams, ”n”-grams, POS,

CHUNK, presence in gazeteer

Can be numerical or boolean Crucial to performance of classifier! In NLTK, a dict of feature name to value

>>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1,

”word2-word3” : 3, ”word1-in-monuments” : False}

Page 11: Text analysis using python

Training

”Teach” the classifier how to classify, using training data

>>> from nltk import NaiveBayesClassifier as nbc

>>> trg = generate_features(training_samples)

>>> random.shuffle(trg)

>>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):]

>>> clf = nbc.train(train)

>>> nltk.classify.accuracy(clf, test)

Page 12: Text analysis using python

Training, part 2

Measure, tune, iterate

Cross Validation Aim is to find a balance between Precision and

Recall

Page 13: Text analysis using python

A Gender Classifier

def gender_features(word):

return {'last_letter': word[-1]}

names = ([(name, 'male') for name in names.words('male.txt')] +

[(name, 'female') for name in names.words('female.txt')])

featuresets = [(gender_features(n), g) for (n,g) in names]

train_set, test_set = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

Page 14: Text analysis using python

Advice on Training

Its TEDIOUS! Cut the cognitive load

e.g.: ”I love this camera!”

BAD:

0 I PRP B-NP -

1 love VBP B-VP -

2 this DT B-NP -

3 camera NN I-NP 1

Good: (I, love) – NO, (love, this camera) – YES

Dealing with ambiguity

Mechanical Turk, jsonwidget

Page 15: Text analysis using python

Other Classifiers

Maximum Entropy Support Vector Machines Conditional Random Fields

All follow the same workflow!

Page 16: Text analysis using python

More Examples

Find questions in Tweets unigrams, bigrams, trigrams, parse distance

Recognize ”contextual questions” in discussions ”reco” words, ”thanks” words

”Use-type” recognizer POS, CHUNK, special words, verbs within 3 words

of phrase e.g., ”I want to compose the perfect landscape shot”

Page 17: Text analysis using python

Eats, shoots and leaves

Common basic step! POS, chunk, parse

tree Hairy theory

FSA, Morphology, Phonology

n-Grams, Probabilistic models, CFGs

Don't Worry, Be Happy!

Page 18: Text analysis using python

No License Required!

What to use? NLTK, or ? Docking with the Evil MotherShip

Jepp, JPype? ► use RPC

>>> from .stanford_corenlp import jsonrpc

>>> server = jsonrpc.TransportTcpIp(...)

>>> result = loads(server.parse(paragraph))

POS, Parse tree, and more (but no chunks?)

Page 19: Text analysis using python

Wordnet®

”a large lexical database of English” synsets, hypernyms, hyponyms, gloss synonyms, antonyms

Sentiwordnet Start with candidate ”good” and ”bad” words Expand by recursively following edges Classify using definition

e.g., ”good” → ”better”, ”good” → ”impressive”

Page 20: Text analysis using python

Love or Hate?

”I love the screen, but the battery life is poor” Shallow?

3 class classifier

Or Deep? Relationship classifier Extracting candidate subjects Lots of unsolved problems – co-reference, multiple

subjects, negation, etc.

Summary ratings for ”Executive Summaries”

Page 21: Text analysis using python

Gender, revisited

solr - search semi-structured text Out of the box text processing utilities – stemming,

tokenising

Highly configurable relevancy fields, weights Sorting! ”sort(term, field, edit) desc” for Levenshtein edit

distance

Page 22: Text analysis using python

Gender, revisited

Schema

<field name="name" type="string" />

<field name="name_phoneme" type="phonetic" />

Search: add ”sort=strdist(unknown_name, name, edit) desc)

Python:

for namerec in results:

If namerec.gender == 'Male':

male_score += namerec.match_score

else:

female_score += namerec.match_score

Correctly guesses ”Sheena” and ”Ashish”!!

80/80 precision/recall

Page 24: Text analysis using python

Thank You!!

Vijay Ramachandran: [email protected]