Top Banner
Introduction to Text Mining Mandar Mitra Indian Statistical Institute M. Mitra (ISI) Text Mining 1 / 29
39

Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Nov 16, 2018

Download

Documents

phamduong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Introduction to Text Mining

Mandar Mitra

Indian Statistical Institute

M. Mitra (ISI) Text Mining 1 / 29

Page 2: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 2 / 29

Page 3: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

What is Text Mining?

.Strict definition..

......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data

OR.Loose definition..

......The science of extracting useful information from large [textual] datasets

.Old wine in a new bottle?..

......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)

M. Mitra (ISI) Text Mining 3 / 29

Page 4: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

What is Text Mining?

.Strict definition..

......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data

OR.Loose definition..

......The science of extracting useful information from large [textual] datasets

.Old wine in a new bottle?..

......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)

M. Mitra (ISI) Text Mining 3 / 29

Page 5: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Why is it interesting?

Growth of Web / electronic information sources

Multidisciplinary nature

E-commerce potential

“Electronic commerce is emerging as the killer domain fordata-mining technology” — RONNY KOHAVI

M. Mitra (ISI) Text Mining 4 / 29

Page 6: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Data sources

World Wide Webunstructured and semi-structured text

“deep” web: pages that do not exist until they are createddynamically as the result of a specific search

social networks

Intranetinternal correspondence, memos, presentations

white papers, technical reports

customer email, customer forums, product reviews

news Wires. . .

.

...... No structure / general schema / tabular form that fits text

M. Mitra (ISI) Text Mining 5 / 29

Page 7: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Data sources

World Wide Webunstructured and semi-structured text

“deep” web: pages that do not exist until they are createddynamically as the result of a specific search

social networks

Intranetinternal correspondence, memos, presentations

white papers, technical reports

customer email, customer forums, product reviews

news Wires. . .

.

...... No structure / general schema / tabular form that fits text

M. Mitra (ISI) Text Mining 5 / 29

Page 8: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Data sources

World Wide Webunstructured and semi-structured text

“deep” web: pages that do not exist until they are createddynamically as the result of a specific search

social networks

Intranetinternal correspondence, memos, presentations

white papers, technical reports

customer email, customer forums, product reviews

news Wires. . .

.

...... No structure / general schema / tabular form that fits text

M. Mitra (ISI) Text Mining 5 / 29

Page 9: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 6 / 29

Page 10: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Any text item (“document”) represented as list of terms andassociated weights

D = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)

Term = keywords or content-descriptors

Weight = measure of the importance of a term in representing theinformation contained in the document

M. Mitra (ISI) Text Mining 7 / 29

Page 11: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Tokenization: identify individual words

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

SachinTendulkar

madea

tearfulbut. . .

M. Mitra (ISI) Text Mining 8 / 29

Page 12: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Stopword removal: eliminate common words, e.g. and, of, the,etc..

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 9 / 29

Page 13: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Stemming: reduce words to a common roote.g. resignation, resigned, resigns → resignanalysis, analyze, analyzing → analy

use standard algorithms (Porter)

M. Mitra (ISI) Text Mining 10 / 29

Page 14: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 15: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 16: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 17: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

.

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

M. Mitra (ISI) Text Mining 11 / 29

Page 18: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Indexing: Term Weights

Term frequency (tf): repeated words are strongly related to content

Inverse document frequency (idf): uncommon term is moreimportant

Normalization by document lengthlong docs. contain many distinct words

long docs. contain same word many times

term-weights for long documents should be reduced

use # bytes, # distinct words, Euclidean length, etc.

Weight = tf x idf / normalization

M. Mitra (ISI) Text Mining 12 / 29

Page 19: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Commonly used weighting schemes

Pivoted normalization [Singhal et al., SIGIR 96]

1+log(tf )1+log(average tf ) × log(Ndf )

(1.0− slope)× pivot + slope ×# unique terms

BM25 (probabilistic model) [Robertson and Zaragoza, FTIR 2009]

tf × log(N−df+0.5df+0.5 )

k1((1− b) + b dlavdl ) + tf

M. Mitra (ISI) Text Mining 13 / 29

Page 20: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Searching

Measure vocabulary overlap between user query and documents.

t1 . . . tnQ = q1 . . . qnD = d1 . . . dn

Sim(Q,D) = Q⃗.D⃗=

∑i qi × di

Use inverted list (index).

Termi → (Di1 , wi1), . . . , (Dik , wik)

M. Mitra (ISI) Text Mining 14 / 29

Page 21: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Searching

Measure vocabulary overlap between user query and documents.

t1 . . . tnQ = q1 . . . qnD = d1 . . . dn

Sim(Q,D) = Q⃗.D⃗=

∑i qi × di

Use inverted list (index).

Termi → (Di1 , wi1), . . . , (Dik , wik)

M. Mitra (ISI) Text Mining 14 / 29

Page 22: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 15 / 29

Page 23: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Stemming

YASS [Majumder et al., ACM TOIS 25(4), 2007]

Stemming ≡ grouping morphologically related words togethere.g. { analysis, analyze, analyzing }

Try clusteringdistance measure: edit distance, or

D(X,Y ) =n−m+ 1

n∑i=m

1

2i−mif m > 0, ∞ otherwise

clustering algorithm: hierarchical agglomerative(single link / complete link / average link)

M. Mitra (ISI) Text Mining 16 / 29

Page 24: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Stemming

0 1 2 3 4 5 6 7 8 9 10 11 12 13

a s t r o n o m i c a l l y

a s t r o n o m e r x x x x

Edit distance = 6D = 6

8 × ( 120

+ . . .+ 1213−8 ) = 1.4766

0 1 2 3 4 5 6 7 8 9

a s t o n i s h x x

a s t r o n o m e r

D = 73 × ( 1

20+ . . .+ 1

29−3 ) = 4.6302Edit distance = 5

M. Mitra (ISI) Text Mining 17 / 29

Page 25: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Stemming

Clustering:

[Courtesy: http://espin086.files.wordpress.com/2011/02/2-variable-clustering.png]

M. Mitra (ISI) Text Mining 18 / 29

Page 26: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Word Relations

Motivation:Manual thesauri are:

general purpose (Roget’s Thesaurus, WordNet) – difficult to use fordocument retrieval

retrieval-oriented (INSPEC, MeSH) – expensive to build andmaintain

Construct an automatic thesaurus (based on information aboutco-occurrence of words in a collection)

M. Mitra (ISI) Text Mining 19 / 29

Page 27: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Word Relations

Association: if two terms co-occur within the same paragraph,they constitute an association

⟨term1, term2,assoc. frequency⟩

Gather data about term-associations over a large amount of text

Refine associations:Discard associations with frequency 1

Discard terms that are associated with too many other terms(people, state, company, etc.)

M. Mitra (ISI) Text Mining 20 / 29

Page 28: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Word Relations

Each term is represented by a vector of associated terms

T = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)

⇒ term = pseudo document

Compare query to the term vectors (instead of document vectors)

Sim(Q,T ) = Σiwt(qi)× wt(ti)

Most “similar” terms are added to the query

Example: 1986 US Immigration Lawsimilar terms: illegal immigration, amnesty program,simpson-mazzoli

M. Mitra (ISI) Text Mining 21 / 29

Page 29: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Word Relations

Experimental results:Data: 500,000 documents (news, computer abstracts, govt.documents); 50 queries

Baseline average precision: 37%

Improves to 6 - 30% by using thesaurus

2 weeks to generate association data!

Processing time can be reduced without major loss inperformance by using a subset of the document collection

M. Mitra (ISI) Text Mining 22 / 29

Page 30: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

M. Mitra (ISI) Text Mining 23 / 29

Page 31: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 32: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 33: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability

..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 34: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability..

Query classification: is theuser looking for an opinion?

M. Mitra (ISI) Text Mining 24 / 29

Page 35: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Opinion Mining

Feature-based opinion summarizationIdentify the features of the product that customers have expressedopinions on (called opinion features)

For each feature, identify how many customer reviews are positive/ negative

Examples:

The pictures are very clear.

Overall a fantastic, very compact, camera.

While light, it will not easily fit in pockets. (HARD!)

M. Mitra (ISI) Text Mining 25 / 29

Page 36: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Opinion Mining

Feature identification1 POS tagging + chunking: identify nouns, verbs, adjectives, simple

noun groups, verb groups

2 Transaction creation for each sentence: item ≡ normalized nouns/ noun phrases

3 Association rule mining: all itemsets with > 1% support arecandidate frequent features

4 Feature pruning:keep features that have some compact occurrences

keep singleton itemsets only if they occur enough times in isolatione.g. manual vs. manual mode, manual setting

M. Mitra (ISI) Text Mining 26 / 29

Page 37: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Opinion Mining

Sentiment / orientation identification1 Examine each sentence in the review database

2 If it contains a frequent feature, extract all the adjective words asopinion words

3 For each feature in the sentence, the nearby adjective is recordedas its effective opinion

4 Look up adjective in a list of adjectives with known orientation, orconsult WordNet (discard unknowns)adjectives arranged in bipolar structures

M. Mitra (ISI) Text Mining 27 / 29

Page 38: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

Datasets

Blog06 (25GB) : University of Glasgowhttp://ir.dcs.gla.ac.uk/test_collections/access_to_data.htm

Congressional floor-debate transcriptshttp://www.cs.cornell.edu/home/llee/data/convote.html

Cornell movie-review datasetshttp://www.cs.cornell.edu/people/pabo/movie-review-data/

M. Mitra (ISI) Text Mining 28 / 29

Page 39: Introduction to Text Mining - Indian Statistical Instituteacmsc/TMW2014/M_mitra.pdf · What is Text Mining?. Strict definition.. The nontrivial extraction of implicit, previously

References

Untangling Text Data Mining. M. Hearst. Proceedings of ACL’99.www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html

An Introduction to Information Retrieval. Manning, Raghavan,Schutze.www-csli.stanford.edu/~schuetze/information-retrieval-book.html

Tutorial on Web Content Mining. Bing Liu. WWW 2005.www.cs.uic.edu/~liub

Web Data Mining. Bing Liu. Springer, 2006.

Opinion Mining and Sentiment Analysis. B. Pang and L. Lee.Foundations and Trends in Information Retrieval, 2(1-2), 2008.

Sentiment Analysis and Opinion Mining. Bing Liu. MorganClaypool, 2012.www.morganclaypool.com/doi/abs/10.2200/S00416ED1V01Y201204HLT016?

journalCode=hltM. Mitra (ISI) Text Mining 29 / 29