Top Banner
Fun With Text Hacking Text Analytics Cohan Sujay Carlos Aiaioo Labs Bangalore
95
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fun with Text - Hacking Text Analytics

Fun With TextHacking Text Analytics

Cohan Sujay CarlosAiaioo LabsBangalore

Page 2: Fun with Text - Hacking Text Analytics

I shall chuck three buzz-words at you:

• One type of big data is unstructured data• One type of unstructured data is text• The analysis of text is text analysis

Is text analytics important?

Page 3: Fun with Text - Hacking Text Analytics

You didn’t answer my question.Is text analytics important?

Hell, I don’t know!But what are you doing reading this if it isn’t?

Page 4: Fun with Text - Hacking Text Analytics

• One Machine Learning Tool• How to reduce Text Analytics Tasks to

steps which you can solve with merely this one Machine Learning Tool

What we’re going to cover

Page 5: Fun with Text - Hacking Text Analytics

The One ML Tool

Get ready for the one ML tool you’ll need to hack text analytics

DRUM ROLL !!!!!!!

Page 6: Fun with Text - Hacking Text Analytics

The One

Get ready for the one ML tool you’ll need to hack text analytics

The Classifier

Page 7: Fun with Text - Hacking Text Analytics

That ML Tool is the Classifier

What is a Classifier?

Something that performs classification.

Classification = categorizingClassification = decidingClassification = labelling

Classification = Deciding = Labelling

Page 8: Fun with Text - Hacking Text Analytics

Classification = Deciding = Labelling

5’11”5’ 8”

Classify these door heights as: Short or Tall ?

5’8”5’11”

6’2”

6’6”5’ 2”

6’8”

6’9”

6’10”

Page 9: Fun with Text - Hacking Text Analytics

Classification = Deciding = Labelling

5’11” Short5’ 8” Short

For classification, you always start withsome labelled data points.

5’8”5’11”

6’2” Tall

6’6” Tall5’ 2” Short

6’8”

6’9”

6’10” Tall

Page 10: Fun with Text - Hacking Text Analytics

Classification = Deciding = Labelling

5’11” Short5’ 8” Short

5’8” Short5’11” Short

6’2” Tall

6’6” Tall5’ 2” Short

6’8” Tall

6’9” Tall

6’10” Tall

If you were doing analysis, you’d ask a human to come up with a rule like:A door below 6’ is Short else it’s Tall

Page 11: Fun with Text - Hacking Text Analytics

Classification = Deciding = Labelling

5’11” Short5’ 8” Short

In ML, you just provide some examples.The computer discovers the rule.

5’8”5’11”

6’2” Tall

6’6” Tall5’ 2” Short

6’8”

6’9”

6’10” Tall

Page 12: Fun with Text - Hacking Text Analytics

Classification = Deciding = Labelling

5’11” Short5’ 8” Short

5’8”5’11”

6’2” Tall

6’6” Tall5’ 2” Short

6’8”

6’9”

6’10” Tall

The ML algorithm learns something like:A door below 6’ is Short else it’s Tall

Page 13: Fun with Text - Hacking Text Analytics

Classification = Deciding = Labelling

5’11” Short5’ 8” Short

5’8” Short5’11” Short

6’2” Tall

6’6” Tall5’ 2” Short

6’8” Tall

6’9” Tall

6’10” Tall

You will learn to create an ML algorithm that learns something like this and that works with text.

Page 14: Fun with Text - Hacking Text Analytics

Topic Classification

Can you tell which is about Politics and which is about Sports?

The United Nations Security Council today

Manchester United beat Barca to reach

Page 15: Fun with Text - Hacking Text Analytics

Topic Classification

Can you train an ML algorithm to tell which is about Politics and which is about Sports?

Manchester United beat Barca to reach

Politics Sports

The United Nations Security Council today

Page 16: Fun with Text - Hacking Text Analytics

Topic Classification

Start by taking some samples of documents on politics and some samples of sports documents.We’re using really short documents so you can do all the calculations manually & see that this ML algorithm really works!

Politics Sports

The United Nations

The United States and

Manchester United

Manchester and Barca

Page 17: Fun with Text - Hacking Text Analytics

Step 1: Learn Multinomial Probabilities

The United States and Politics

Manchester United

Manchester and BarcaSports

P(United/Politics) = 2/7

The United Nations

P(Nations/Politics) = 1/7P(States/Politics) = 1/7

P(Manchester/Sports) = 2/5P(United/Sports) = 1/5P(Barca/Sports) = 1/5

P(Politics) = 7/12 P(Sports) = 5/12

P(The/Politics) = 2/7 P(and/Sports) = 1/5P(and/Politics) = 1/7

Page 18: Fun with Text - Hacking Text Analytics

One ML Algorithm

Step 1 was easy!!!

Are you ready for Step 2 ?

Page 19: Fun with Text - Hacking Text Analytics

Step 2: There’s no step 2

P(United/Politics) = 2/7P(Nations/Politics) = 1/7P(States/Politics) = 1/7

P(Manchester/Sports) = 2/5P(United/Sports) = 1/5P(Barca/Sports) = 1/5

P(Politics) = 7/12 P(Sports) = 5/12

P(The/Politics) = 2/7 P(and/Sports) = 1/5P(and/Politics) = 1/7

This is a Naïve Bayesian Classifier!!!

Page 20: Fun with Text - Hacking Text Analytics

One ML Algorithm

Let’s put the classifier to work:

Let’s see if it can classify the following documents :

1. United Nations2. Manchester United

We are using deliberately short documents!

Page 21: Fun with Text - Hacking Text Analytics

Running the Topic Classifier

United Nations

P(Politics|United Nations)= P(United|P)*P(Nations|Politics)*P(Politics)= (2/7)*(1/7)*(7/12) = 1/(7*6)

P(Sports|United Nations)= P(United|S)*P(Nations|Sports)*P(Sports)= (1/5)*(0)*(5/12) = 0

Page 22: Fun with Text - Hacking Text Analytics

Running the Topic Classifier

United Nations

P(Politics|United Nations)>

P(Sports|United Nations)

So, the classifier has returned the category POLITICS

Page 23: Fun with Text - Hacking Text Analytics

Running the Topic Classifier

Manchester United

P(Politics|Manchester United)= P(Manchester|P)*P(United|P)*P(Politics)= (0)*(2/7)*(7/12) = 0

P(Sports|Manchester United)= P(Manchester|S)*P(United|S)*P(Sports)= (2/5)*(1/5)*(5/12) = 2/(5*12)

Page 24: Fun with Text - Hacking Text Analytics

Running the Topic Classifier

P(Sports|Manchester United)>

P(Politics|Manchester United)

So, the classifier has returned the category SPORTS

Manchester United

Page 25: Fun with Text - Hacking Text Analytics

Topic Classification

Manchester United

Politics

United Nations

Sports

We have successfully used an ML algorithm to tell us which document is about Politics and which is about Sports !!!

Page 26: Fun with Text - Hacking Text Analytics

Can We Solve Other Problems?

Now, we have an ML tool in our toolkit – a Naïve Bayesian classifier.

Page 27: Fun with Text - Hacking Text Analytics

Can We Solve Other Problems?

So, if we can represent a text analysis problemas a classification task, then we can solve itusing ML.

Now, we have an ML tool in our toolkit – a Naïve Bayesian classifier.

Page 28: Fun with Text - Hacking Text Analytics

Can We Solve Other Problems?

So, if we can represent a text analysis problemas a classification task, then we can solve itusing ML.

Now, we have an ML tool in our toolkit – a Naïve Bayesian classifier.

So, let us learn how to represent text analysisproblems as classification tasks.

Page 29: Fun with Text - Hacking Text Analytics

Problem 1: Sentence Segmentation

The problem of identifying the end-points of sentences is called sentence segmentation. It can be reduced to a classification problem.

Page 30: Fun with Text - Hacking Text Analytics

Sentence Segmentation

Yes, Mr. Anurag. You need a D.L. to drive.

Sentence segmentation

Classify each ‘.’, ‘!’ and ‘?’ into:1) sentence terminator and2) not a sentence terminator.

Page 31: Fun with Text - Hacking Text Analytics

Sentence Segmentation

Hurray! We have turned the problem of sentence segmentation into a classification problem.

Once you have modeled a text analytics problem as an ML problem, there is one more step you need to perform to get a working solution.

Page 32: Fun with Text - Hacking Text Analytics

Sentence Segmentation

FIND THE FEATURES!

Page 33: Fun with Text - Hacking Text Analytics

Features

Which country’s flag is this?

Features are the clues you need for good decision making.

Page 34: Fun with Text - Hacking Text Analytics

Features

Which country’s flag is this?

One important feature for solving this decision problem is colour.

Page 35: Fun with Text - Hacking Text Analytics

Features for Topic Classification

P(United/Politics) = 2/7P(Nations/Politics) = 1/7P(States/Politics) = 1/7

P(Manchester/Sports) = 2/5P(United/Sports) = 1/5P(Barca/Sports) = 1/5

P(Politics) = 7/12 P(Sports) = 5/12

P(The/Politics) = 2/7 P(and/Sports) = 1/5P(and/Politics) = 1/7

For topic classification, the features are the individual words in the text.

Page 36: Fun with Text - Hacking Text Analytics

Sentence Segmentation

What are the features you would use?

Yes, Mr. Anurag. You need a D.L. to drive.

That’s when you start reading the research papers!!!

Page 37: Fun with Text - Hacking Text Analytics

Sentence Segmentation

What are the features you would use?Is_This_Character_a_DotIs_This_Within_QuotesIs_Next_Letter_CapitalizedIs_Prev_Letter_CapitalizedNumber_of_Words_in_Sentence_so_FarIs_Next_Word_a_Name

Yes, Mr. Anurag. You need a D.L. to drive.

Page 38: Fun with Text - Hacking Text Analytics

Sentence Segmentation

Train an NB classifier on some text whose characters are marked as sentence terminators or not.

It will learn to assign characters to the categories word terminator and not a word terminator.

Page 39: Fun with Text - Hacking Text Analytics

Problem 2: Tokenization

Now, get a D.L., Mr. Anurag.

For almost all analyses, you have to identify the individual words in the text.

How can you break a sentence into words?

Now , get a D.L. , Mr. Anurag .

Page 40: Fun with Text - Hacking Text Analytics

Tokenization

Yes, Mr. Anurag. You need a D.L. to drive.

Tokenization

Classify each ‘.’, ‘!’, ‘,’, ‘ ’, ‘?’ etc. into:1) word terminator and2) not a word terminator.

Page 41: Fun with Text - Hacking Text Analytics

Tokenization or Word Segmentation

We have turned the problem of word segmentation into a classification problem.

Train a classifier on text whose characters are marked as word terminators or not. It will learn to assign characters to the categories word terminator and not a word terminator.

Page 42: Fun with Text - Hacking Text Analytics

Problem 3: Part of Speech (POS) Tagging

Anurag needs a D.L.

Anurag/NNP needs/VBZ a/DT D.L./NN

How do you tag the words in a sentence?

Page 43: Fun with Text - Hacking Text Analytics

Problem 3: Part of Speech (POS) Tagging

Anurag needs a D.L.

Anurag/NNP needs/VBZ a/DT D.L./NN

POS Tagging

Run from the first to the last word in the sentence, classifying each word in the sentence into a part-of-speech category.

Page 44: Fun with Text - Hacking Text Analytics

We have turned the problem of POS Tagging into a Classification problem where you label each word as:

1. Noun2. Verb3. Adjective4. Adverb5. Interjection6. Conjunction (and, or, if, neither … nor)7. Pronoun (I, it, you, them)8. Preposition (in, of, out)

Problem 3: Part of Speech (POS) Tagging

Page 45: Fun with Text - Hacking Text Analytics

Problem 4: Named Entity Recognition

Anurag is looking for a Hyundai car in Bangalore

Anurag = PersonHyundai car = VehicleBangalore = Place

How do you extract the named entities in a sentence using a classifier?

Page 46: Fun with Text - Hacking Text Analytics

Problem 4: Named Entity Recognition

Anurag is looking for a Hyundai car in Bangalore

Anurag = PersonHyundai car = VehicleBangalore = Place

Anurag/Person is/Other looking/Other for/Other a/Other Hyundai/Vehicle car/Vehicle in/Other Bangalore/Place

Page 47: Fun with Text - Hacking Text Analytics

We have turned the problem of Named Entity Recognition (NER) into a Tagging problem where you label each word as:

1. Person2. Vehicle3. Place

Problem 4: Named Entity Recognition

Page 48: Fun with Text - Hacking Text Analytics

But we know that we can turn the problem of Tagging into a Classification problem over the same labels:

1. Person2. Vehicle3. Place

Problem 4: Named Entity Recognition

Page 49: Fun with Text - Hacking Text Analytics

Problem 4: Named Entity Recognition

Anurag is looking for a Hyundai car in Bangalore

Anurag/Person is/Other looking/Other for/Other a/Other Hyundai/Vehicle car/Vehicle in/Other Bangalore/Place

Named Entity RecognitionRun from the first to the last word in the sentence, classifying each word in the sentence into a Named Entity Recognition category.

Page 50: Fun with Text - Hacking Text Analytics

Hey We Have One More ML Tool

We just built one more very useful ML tool!

The Extractor

Page 51: Fun with Text - Hacking Text Analytics

Extraction

What is an Extractor?

Something that performs extraction.

Extraction = recognizingExtraction = findingExtraction = locating

Extraction = Finding = Locating

Page 52: Fun with Text - Hacking Text Analytics

Problem 5: Relation Extraction

Tim Cook is the new CEO of Apple Computers

Relation: CEO_of Tim Cook (Person) Apple Computers (Org)

How do you identify relations between entities in a sentence using a classifier?

Page 53: Fun with Text - Hacking Text Analytics

Extracting Meaning

Text: Tim Cook is the new CEO of Apple Computers

Analysis: Tim/Person Cook/Person is the new CEOof Apple/Org Computers/Org

Relation Extraction

Step 1:

Page 54: Fun with Text - Hacking Text Analytics

Extracting Meaning

Text: Tim Cook is the new CEO of Apple Computers

Analysis: Tim/Person Cook/Person is the new CEOof Apple/Org Computers/Org

Relation Extraction

Step 1:

Step 2:

Relation Extraction{Tim/Person Cook/Person, Apple/Org Computers/Org} => CEO_of

Page 55: Fun with Text - Hacking Text Analytics

Extracting Meaning

Text: Tim Cook is the new CEO of Apple Computers

Analysis: Tim/Person Cook/Person is the new CEOof Apple/Org Computers/Org

Relation Extraction

Step 1:

Step 2:

Relation ExtractionRun through all pairs of named entities, classifying each pair into CEO_of or Other.

Page 56: Fun with Text - Hacking Text Analytics

But look at what we’re extracting!

Do you realize what we’re extracting?

Meaning !!!

Page 57: Fun with Text - Hacking Text Analytics

Extracting Meaning

Text: I am looking for a Hyundai car in B’lore

Syntactic Analysis: I/Pronoun am/BE looking/V for a/DT Hyundai/NP car/NN in/PP Bangalore/NP

Semantic Analysis: I/Person am looking for aHyundai/Vehicle car/Vehicle in Bangalore/Place

Page 58: Fun with Text - Hacking Text Analytics

But is this comprehensive?

Are you telling me that this is all I have to do to extract every possible sort of simple meaning?

Yeah!!!

All utterances fall into two main categories …

Page 59: Fun with Text - Hacking Text Analytics

Intentional: I want to buy a computer.

Information: There was heavy snowfall in Sikkim.

Two Kinds of Utterances

360 Degree Text Analysis

Page 60: Fun with Text - Hacking Text Analytics

I want to buy a computer.

How do you deal with intentional

utterances

360 Degree Text Analysis

Page 61: Fun with Text - Hacking Text Analytics

Intentional Utterance

Raw Text: Are you sad that Steve Jobs died?

Analysis: This person is inquiring aboutsomeone’s emotions concerning Steve Jobs

Intention Analysis

Intention Holder: IIntention: inquire

Page 62: Fun with Text - Hacking Text Analytics

Subjective Utterance

Raw Text: I am sad that Steve Jobs died

Analysis: This person holds a positive opinionon Steve Jobs

Sentiment Analysis

Sentiment Holder: IObject of Sentiment: Steve JobsPolarity of Sentiment: positive

Page 63: Fun with Text - Hacking Text Analytics

How do you deal with informational

utterances

360 Degree Text Analysis

There was heavy snowfall in Sikkim.

Page 64: Fun with Text - Hacking Text Analytics

Approaches to Extracting Meaning

Raw Text: There is heavy snowfall in Sikkim.

Analysis: Snowfall event

Event Analysis

Event: snowfall

Page 65: Fun with Text - Hacking Text Analytics

Approaches to Extracting Meaning

Raw Text: Bangalore is the capital of K’taka

Analysis: capital_of relation exists

Fact Analysis

Entity: Bangalore/PlaceKarnataka/Place

Relation: Bangalore capital_of K’taka

Page 66: Fun with Text - Hacking Text Analytics

Extracting Meaning

Text: I am looking for a Hyundai car in Bangalore

Semantic Analysis: I/Person am looking for aHyundai/Thing car/Thing in Bangalore/Place

Entity Extraction

Entities:I PersonHyundai car ThingBangalore Place

Fact Analysis

Page 67: Fun with Text - Hacking Text Analytics

Extracting Meaning

Text: Tim Cook is the new CEO of Apple Computers

Analysis: Tim/Person Cook/Person is the new CEOof Apple/Org Computers/Org

Relation Extraction

Relation: CEO_of Tim Cook (Person) Apple Computers(Org)

Fact Analysis

Page 68: Fun with Text - Hacking Text Analytics

So, I can do anything (badly?)

I got it! I got it! I can do anything.

But how do I know how well I am doing it ?

Page 69: Fun with Text - Hacking Text Analytics

Measurement

Measurement:

This measurement thing is very important in any design process, because …

Page 70: Fun with Text - Hacking Text Analytics

Measurement

How do you measure the performance of a classifier?

Measurement:… it lets you compare two designs and decide which is better.

Page 71: Fun with Text - Hacking Text Analytics

Before we go on …

… promise me …

… that you will never forget what I am about to tell you …

Page 72: Fun with Text - Hacking Text Analytics

Measurement

Break up the data points into training and test parts (usually an 80:20 split) and never test on your training data.

Page 73: Fun with Text - Hacking Text Analytics

Measurement

Why not test on the training data?

Break up the data points into training and test – usually an 80:20 split.

Train on 80%

Test on the remaining 20%

Page 74: Fun with Text - Hacking Text Analytics

Measurement

Why not develop on the test data?

Or if you are still developing features,break up the data points into training, development and test – usually a 70:10:20 split.

Train on 70%Develop on 10%Test on the remaining 20%

Page 75: Fun with Text - Hacking Text Analytics

That applies to college as well …

… you won’t get accurate measurements …

… if you test students using questions that appeared in the question bank!

… and you will be encouraging students to learn things by rote!

Page 76: Fun with Text - Hacking Text Analytics

Classification Quality Metric - Accuracy

Correct Answers

Total Number of Questions

Page 77: Fun with Text - Hacking Text Analytics

Classification Quality Metric - Accuracy

Politics Documents classified as Politics+ Sports Documents classified as Sports

Total Number of Documents

If your categories are Politics and Sports

Page 78: Fun with Text - Hacking Text Analytics

Classification Quality Metric - Accuracy

Gold - Politics Gold – Sports

Observed - Politics TN (True Negative) FN (False Positive)

Observed - Sports FP (False Negative) TP (True Positive)

Point of View = Sports = (+ve)

Page 79: Fun with Text - Hacking Text Analytics

Accuracy

Gold – Sports (1000) Gold – Politics (1000)

Observed – Sports TN = 990 FN = 100

Observed – Politics FP = 10 TP = 900

Point of View - Politics

= ?

Page 80: Fun with Text - Hacking Text Analytics

Accuracy

Gold – Sports (1000) Gold – Politics (1000)

Observed – Sports TN = 990 FN = 100

Observed – Politics FP = 10 TP = 900

Point of View - Politics

= 94.5%

Page 81: Fun with Text - Hacking Text Analytics

Classification Quality Metric - Recall

How many Politics document did it find

Total number of Politics documents in the test data

Page 82: Fun with Text - Hacking Text Analytics

Recall

Gold – Sports (1000) Gold – Politics (1000)

Observed – Sports TN = 990 FN = 100

Observed – Politics FP = 10 TP = 900

Point of View - Politics

= ?

Page 83: Fun with Text - Hacking Text Analytics

Recall

Gold – Sports (1000) Gold – Politics (1000)

Observed – Sports TN = 990 FN = 100

Observed – Politics FP = 10 TP = 900

Point of View - Politics

= 90%

Page 84: Fun with Text - Hacking Text Analytics

Recall

Gold – Sports (1000) Gold – Politics (1000)

Observed – Sports TP = 990 FP = 100

Observed – Politics FN = 10 TN = 900

Point of View - Sports

= ?

Page 85: Fun with Text - Hacking Text Analytics

Recall

Gold – Sports (1000) Gold – Politics (1000)

Observed – Sports TP = 990 FP = 100

Observed – Politics FN = 10 TN = 900

Point of View - Sports

= 99%

Page 86: Fun with Text - Hacking Text Analytics

Classification Quality Metric - Precision

How many were really Politics documents

Total number of documents the classifier identified as Politics

Page 87: Fun with Text - Hacking Text Analytics

Precision

Point of View - Politics

= ?

Gold – Sports (1040) Gold – Politics (960)

Observed – Sports TN = 990 FN = 10

Observed – Politics FP = 50 TP = 950

Page 88: Fun with Text - Hacking Text Analytics

Precision

Point of View - Politics

= 95%

Gold – Sports (1040) Gold – Politics (960)

Observed – Sports TN = 990 FN = 10

Observed – Politics FP = 50 TP = 950

Page 89: Fun with Text - Hacking Text Analytics

Precision

Point of View - Sports

= ?

Gold – Sports (1040) Gold – Politics (960)

Observed – Sports TP = 990 FP = 10

Observed – Politics FN = 50 TN = 950

Page 90: Fun with Text - Hacking Text Analytics

Precision

Point of View - Sports

= 99%

Gold – Sports (1040) Gold – Politics (960)

Observed – Sports TP = 990 FP = 10

Observed – Politics FN = 50 TN = 950

Page 91: Fun with Text - Hacking Text Analytics

Are we done?

… kinda …

The more training you have, the better you will get at text analysis … so keep learning!

Page 92: Fun with Text - Hacking Text Analytics

How to Learn More?

Grab the UC Berkeley Natural Language Processing Course’s slides. The course’s name is CS 294.

Start reading the ACL conference’s research papers (ACL = Association of Computational Linguistics)

Page 93: Fun with Text - Hacking Text Analytics

Now, are we done?

… kinda …

Have Fun!

… and I have an exercise for you …

Page 94: Fun with Text - Hacking Text Analytics

Did you know

… that you can use a classifier …

… to do clustering?

The exercise is … think about it …

Page 95: Fun with Text - Hacking Text Analytics

THE END

Fun With TextHacking Text Analytics

Cohan Sujay CarlosAiaioo LabsBangalore

The bottom