Top Banner
Text Mining IS698 Min Song
56

Text Mining IS698 Min Song. The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text Mining

IS698Min Song

Page 2: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

The Needs:- Find people as well as documents that can

address my information need.- Promote collaboration and knowledge

sharing- Leverage existing information access

system- The Information Sources:

- Email, groupware, online reports, …

Example 1: KM People Finder

Page 3: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Example 1:Simple KM People Finder

RelevantDocs

Search or Navigation

System

NameExtractor Authority

List

Query

Ranked People Names

Page 4: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Example 1: KM People Finder

Page 5: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

• An exploration and analysis of textual (natural-language) datatextual (natural-language) data by automatic and semi automatic means to discover new knowledge.

Text Mining Definition

Many definitions in the literature“The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”.

Page 6: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

What is ““previously unknown”previously unknown” information ? Strict definition

Information that not even the writer knows. e.g., Discovering a new method for a hair growth

that is described as a side effect for a different procedure

Lenient definition Rediscover the information that the author

encoded in the text e.g., Automatically extracting a product’s name

from a web-page.

Text Mining Definition

Page 7: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Outline

Text mining applications Text characteristics Text mining process Learning methods

Page 8: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text Mining Applications Marketing: Discover distinct

groups of potential buyers according to a user text based profile e.g. amazon

Industry: Identifying groups of competitors web pages e.g., competing products

and their prices Job seeking: Identify

parameters in searching for jobs e.g., www.flipdog.com

Page 9: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Information Retrieval Indexing and retrieval of textual documents

Information Extraction Extraction of partial knowledgepartial knowledge in the text

Web Mining Indexing and retrieval of textual documents and

extraction of partial knowledge using the web Clustering

Generating collections of similar text documents

Text Mining Methods

Page 10: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Information Retrieval

Given: A source of textual

documents A user query (text

based)

IRSystem

QueryE.g. Spam / Text

Documentssource

• Find:

• A set (ranked) of documents that are relevant to the query

RankedDocuments

Document

DocumentDocument

Page 11: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Intelligent Information Retrieval meaning of words

Synonyms “buy” / “purchase” Ambiguity “bat” (baseball vs. mammal)

order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park

user dependency for the data direct feedback indirect feedback

authority of the source IBM is more likely to be an authorized source then my

second far cousin

Page 12: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Given: A source of textual documents A well defined limited query (text based)

Find: Sentences with relevantrelevant information Extract the relevant information and ignore non-relevant information (important!) Link related information and output in a

predetermined format

What is Information Extraction?

Page 13: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Information Extraction: Example

Salvadoran President-elect Alfredo Cristiania condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … According to the police and Garcia Alvarado’s driver, who escaped unscathed, the attorney general was traveling with two bodyguards. One of them was injured.

Incident Date: 19 Apr 89 Incident Type: Bombing Perpetrator Individual ID: “urban guerillas” Human Target Name: “Roberto Garcia Alvarado” ...

Page 14: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

What is Information Extraction?

ExtractionSystem

Documentssource

RankedDocuments

Relevant Info 1

Relevant Info 2

Relevant Info 3

Query 1 (E.g. job title)Query 2 (E.g. salary)

CombineQuery Results

Page 15: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Why Mine the Web? Enormous wealth of textual information on the

Web. Book/CD/Video stores (e.g., Amazon) Restaurant information (e.g., Zagats) Car prices (e.g., Carpoint)

Lots of data on user access patterns Web logs contain sequence of URLs accessed by users

Possible to retrieve “previously unknown” information People who ski also frequently break their leg. Restaurants that serve sea food in California are likely

to be outside San-Francisco

Page 16: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Mining the Web

IR / IESystem

Query

Documentssource

RankedDocuments

1. Doc12. Doc2

3. Doc3 . .

Web Spider

Page 17: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

The Web is a huge collection of documents where many contain: Hyper-linkHyper-link information Access and usage information

The Web is very dynamic Web pages are constantly being

generated (removed)

Unique Features of the Web

Challenge: Develop new Web mining algorithms to . . .• Exploit hyper-links and access patterns.• Be adaptable to its documents source

Page 18: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Combine the intelligent IR tools meaningmeaning of words orderorder of words in the query user dependencyuser dependency for the data authorityauthority of the source

With the unique web features retrieve Hyper-link information utilize Hyper-link as input

Intelligent Web Search

Page 19: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

What is Clustering ? Given:

A source of textual documents

Similarity measure e.g., how many

words are common in these documents

ClusteringSystem

Similarity measure

Documentssource

DocDo

cDoc

Doc

Doc

DocDoc

Doc

DocDoc

• Find:• Several clusters of

documents that are relevant to each other

Page 20: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Outline

Text mining applications Text characteristics Text mining process Learning methods

Page 21: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text characteristics: Outline

Large textual data base High dimensionality Several input modes Dependency Ambiguity Noisy data Not well structured text

Page 22: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text characteristics Large textual data base

Efficiency consideration over 2,000,000,000 web pages almost all publications are also in electronic form

High dimensionality (Sparse input) Consider each word/phrase as a dimension

Several input modes e.g., Web mining: information about user is

generated by semantics, browse pattern and outside knowledgebase.

Page 23: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text characteristics Dependency

relevant information is a complex conjunction of words/phrases e.g., Document categorization. Pronoun disambiguation.

Ambiguity Word ambiguity

Pronouns (he, she …) “buy”, “purchase”

Semantic ambiguity The king saw the rabbit with his glasses.

Page 24: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text characteristics

Noisy data Example: Spelling mistakes

Not well structured text Chat rooms

“r u available ?” “Hey whazzzzzz up”

Speech

Page 25: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Outline

Text mining applications Text characteristics Text mining process Learning methods

Page 26: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text mining process

Page 27: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text mining process Text preprocessing

Syntactic/Semantic text analysis

Features Generation Bag of words

Features Selection Simple counting Statistics

Text/Data Mining Classification-

Supervised learning Clustering-

Unsupervised learning Analyzing results

Page 28: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Part Of Speech (pos) tagging Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) ~98% accurate.

Word sense disambiguation Context basedContext based or proximity basedproximity based Very accurate

Parsing Generates a parse treeparse tree (graph) for each sentence Each sentence is a stand alone graph

Syntactic / Semantic text analysis

Page 29: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Given: a collection of labeled records (training settraining set) Each record contains a set of features (attributesattributes),

and the true class (labellabel) Find: a modelmodel for the class as a function of the values

of the features Goal: previously unseen records should be assigned a

class as accurately as possible A test settest set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it

Text Mining: Classification definition

Page 30: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Similarity Measures:• Euclidean DistanceEuclidean Distance if attributes are continuous• Other Problem-specific Measures

• e.g., how many words are common in these documents

Given: a set of documents and a similarity measuresimilarity measure among documents

Find: clusters such that: Documents in one cluster are more similar to one

another Documents in separate clusters are less similar to

one another Goal:

Finding a correctcorrect set of documents

Text Mining: Clustering definition

Page 31: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Supervised learning (classification) Supervision: The training data (observations,

measurements, etc.) are accompanied by labelslabels indicating the class of the observations

New data is classified based on the training set Unsupervised learning (clustering)

The class labels of training data is unknown Given a set of measurements, observations, etc.

with the aim of establishing the existence of classes or clusters in the data

Supervised vs. Unsupervised Learning

Page 32: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Correct classification: The known label of test sample is identical with the class resultclass result from the classification model

Accuracy ratio: the percentage of test set samples that are correctly classified by the model

A distance measuredistance measure between classes can be used e.g., classifying “football” document as a

“basketball” document is not as bad as classifying it as “crime”.

Evaluation:What Is Good Classification?

Page 33: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Good clustering method: produce high quality clusters with . . . high intra-classintra-class similarity low inter-classinter-class similarity

The qualityquality of a clustering method is also measured by its ability to discover some or all of the hiddenhidden patterns

Evaluation: What Is Good Clustering?

Page 34: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Outline

Text mining applications Text characteristics Text mining process Learning methods

Classification Clustering

Page 35: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Classification: An Example

Ex# Country Marital Status

Income Hooligan

1 England Single 125K Yes

2 England Married Yes

3 England Single 70K Yes

4 Italy Married 40K No

5 USA Divorced 95K No

6 England Married 60K Yes

7 England 20K Yes

8 Italy Single 85K Yes

9 France Married 75K No

10 Denmark Single 50K No 10

categoric

al

categoric

al

continuous

class

Training Set

ModelLearn

Classifier

Country Marital Status

Income Hooligan

England Single 75K ?

Turkey Married 50K ?

England Married 150K ?

Divorced 90K ?

Single 40K ?

Itlay Married 80K ? 10

TestSet

Page 36: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text Classification: An Example

Ex# Hooligan

1 An English football fan …

Yes

2 During a game in Italy …

Yes

3 England has been beating France …

Yes

4 Italian football fans were cheering …

No

5 An average USA salesman earns 75K

No

6 The game in London was horrific

Yes

7 Manchester city is likely to win the championship

Yes

8 Rome is taking the lead in the football league

Yes 10

class

Training Set

ModelLearn

Classifier

text

TestSet

Hooligan

A Danish football fan ?

Turkey is playing vs. France. The Turkish fans …

? 10

Page 37: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Classification Techniques

Instance-Based Methods Decision trees Neural networks Bayesian classification

Page 38: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Instance-based (memory based) learning Store training examples and delay the

processing (“lazy evaluation”) until a new instance must be classified

k-nearest neighbor approach InstancesInstances (Examples) are represented

as points in a Euclidean spacepoints in a Euclidean space

Instance-based Methods

Page 39: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

foot

ball

Italian

The English footballfootball fan is a hooligan. . .

foot

ball

Italian

Similar to his English equivalent, the ItalianItalianfootballfootball fan is a hooligan. . .

Text Examples in Euclidean Space

Page 40: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

All instances correspond to points in the nn-D space The nearest neighbor are defined in terms of

Euclidean distance

.

_+

+ ?

+

_ _+

_

_

+

_+

+ +

+

_ _+

_

_

+

• The kk-NN-NN returns the most common value among the kk nearest training examples

• Voronoi diagram: the decision surface induced by 11-NN-NN for a typical set of training examples

K-Nearest Neighbor Algorithm

Page 41: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Classification Techniques

Instance-Based Methods Decision trees Neural networks Bayesian classification

Page 42: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Ex# Country Marital Status

Income Hooligan

1 England Single 125K Yes

2 England Married 100K Yes

3 England Single 70K Yes

4 Italy Married 40K No

5 USA Divorced 95K No

6 England Married 60K Yes

7 England Divorced 20K Yes

8 Italy Single 85K Yes

9 France Married 75K No

10 Denmark Single 50K No 10

categoric

al

categoric

al

continuous

class

Decision Tree: An Example

YesEnglish

Yes

No

MarSt

NO

Married Single, Divorced

Splitting Attributes

Income

YESNO

> 80K < 80K

The splitting attribute at a node is

determined based on a specific

Attribute selection algorithm

Page 43: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Ex# Hooligan

1 An English football fan …

Yes

2 During a game in Italy …

Yes

3 England has been beating France …

Yes

4 Italian football fans were cheering …

No

5 An average USA salesman earns 75K

No

6 The game in London was horrific

Yes

7 Manchester city is likely to win the championship

Yes

8 Rome is taking the lead in the football league

Yes 10

classte

xt

Decision Tree: A Text Example

YesEnglish

Yes

No

MarSt

NO

Married Single, Divorced

Splitting Attributes

Income

YESNO

> 80K < 80K

The splitting attribute at a node is

determined based on a specific

Attribute selection algorithm

Page 44: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution

Decision tree generation consists of two phases: Tree construction Tree pruning

Identify and remove branches that reflect noisenoise or outliersoutliers

Use of decision tree: Classifying an unknown sample Test the attribute of the sample against the decision

tree

Classification by DT Induction

Page 45: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Partitioning Methods Hierarchical Methods

Clustering Techniques

Page 46: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Partitioning method: Construct a partition of n documents into a set of k clusters

Given: a set of documents and the number k Find: a partition of k clusters that optimizes the

chosen partitioning criterion Global optimalGlobal optimal: exhaustively enumerate all

partitions Heuristic methods: k-means and k-medoids

algorithms k-meansk-means: Each cluster is represented by the center

of the cluster

Partitioning Algorithms

Page 47: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

k-means algorithm is implemented in 4 steps:

1. Partition objects into kk nonempty subsets.2. Compute seed points as the centroidscentroids of the

clusters of the current partition. The centroid is the center (mean point) of the cluster.

3. Assign each object to the cluster with the nearest seed point.

4. Go back to Step 2, stop when no more new assignment.

The K-means Clustering Method

Page 48: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

The K-means Clustering: Example

Page 49: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Partitioning Methods Hierarchical Methods

Clustering Techniques

Page 50: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Agglomerative: Start with each document being a single cluster. Eventually all document belong to the same

cluster.

Divisive: Start with all document belong to the same cluster. Eventually each node forms a cluster on its own.

Does not require the number of clusters k in advance

Needs a termination condition

The final mode in both Agglomerative and Divisive in of no use.

Hierarchical Clustering

Page 51: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Step 0

b

d

c

e

a a b

Step 1 Step 2

d e

Step 3

c d e

Step 4

a b c d e

agglomerative

Step 4 Step 3 Step 2 Step 1 Step 0

divisive

Hierarchical Clustering: Example

Page 52: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

• Dendrogram: Decomposes data objects into a several levels of nested partitioning (tree of clusters).

• Clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connectedconnected component forms a cluster.

A Dendogram: Hierarchical Clustering

Page 53: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Demo

Page 54: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Commercial Tools

IBM Intelligent Miner for Text Semio Map InXight LinguistX / ThingFinder LexiQuest ClearForest Teragram SRA NetOwl Extractor Autonomy

Page 55: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Text is tricky to process, but “ok” results are easily

achieved

There exist several text mining systemstext mining systems

e.g., D2K - Data to Knowledge

http://www.ncsa.uiuc.edu/Divisions/DMV/ALG/

Additional IntelligenceIntelligence can be integrated with text

mining

One may play with any phase of the text mining

process

Summary

Page 56: Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.

Summary

There are many other scientific and statistical text mining scientific and statistical text mining

methodsmethods developed but not covered in this talk.

http://www.cs.utexas.edu/users/pebronia/text-mining/

http://filebox.vt.edu/users/wfan/text_mining.html

Also, it is important to study theoretical foundationstheoretical foundations of data

mining.

Data Mining Concepts and Techniques / J.Han &

M.Kamber

Machine Learning, / T.Mitchell