Top Banner
Text Mining Patrick Cash
29
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Text Mining Patrick Cash Outline

Text Mining

Patrick Cash

Page 2: Text Mining Patrick Cash Outline

Outline• Introduction• Data Mining• Text Mining

– Text Mining Process• Text Mining Applications • Challenges in Text Mining• Conclusion

Page 3: Text Mining Patrick Cash Outline

Introduction• Why Text Mining?

– Massive amount of new information being created• World’s data doubles every 18 months (Jacques Vallee

Ph.D)– 80-90% of all data is held in various unstructured

formats– Useful information can be derived from this

unstructured data

Page 4: Text Mining Patrick Cash Outline

Introduction• Intelligence in text mining is based on NLP

techniques• Can be used as a preprocessing technique to

harvest data and get an initial understanding of the patterns that exist in the data

• Often seen as a special case of data mining but there is an important difference

Page 5: Text Mining Patrick Cash Outline

Data Mining• Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information (or patterns) from data

• Data Mining: a misnomer?– Knowledge discovery, knowledge extraction,

data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Page 6: Text Mining Patrick Cash Outline

Data Mining• Descriptive: understanding underlying processes

or behavior – Patterns and trends– Clustering

• Predictive: predict an unseen or unmeasured value – Future projections and missing values– Classification

Page 7: Text Mining Patrick Cash Outline

SEMMA• Search

– Input data source, data sampling, partitioning• Explore

– Patterns, trends, outliers, visualization• Modify

– Clustering, feature reduction• Model

– Regression, tree, network• Assess

– Report, pass to next step in analysis

Page 8: Text Mining Patrick Cash Outline

Search vs. Discover

Data Mining

Text Mining

DataRetrieval

InformationRetrieval

Search(goal-oriented)

Discover(opportunistic)

StructuredData

UnstructuredData (Text)

Page 9: Text Mining Patrick Cash Outline

Text Mining• Many different by similar definitions• Text Mining = Statistical NLP + Data Mining• Text Mining is a process that employs

– Statistical NLP: a set of algorithms for converting unstructured text into structured data objects

– Data Mining: the quantitative methods that analyze these data objects to discover knowledge

Page 10: Text Mining Patrick Cash Outline

Text Mining• Descriptive

– Pattern and trend analysis– Knowledge base creation– Summarization– Visualization

• Predictive– Classification – Question answering– Pattern and trend forecasting

Page 11: Text Mining Patrick Cash Outline

Text Mining Techniques• Information Retrieval

– Indexing and retrieval of textual documents• Information Extraction

– Extraction of partial knowledge in the text• Web Mining

– Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building)

• Clustering– Generating collections of similar text documents

Page 12: Text Mining Patrick Cash Outline

Text Mining Process

Page 13: Text Mining Patrick Cash Outline

Text Mining Process• Text Preprocessing

– Syntactic/Semantic text analysis • Features Generation

– Bag of words • Features Selection

– Simple counting– Statistics

• Text/Data Mining– Classification (Supervised) / Clustering (Unsupervised)

• Analyzing results

Page 14: Text Mining Patrick Cash Outline

Text Mining Process• Text preprocessing

– Part Of Speech (POS) tagging• Find the corresponding POS for each word.

– Word sense disambiguation• Context based or proximity based

– Parsing• Generates a parse tree (graph) for each sentence• Each sentence is a stand alone graph

Page 15: Text Mining Patrick Cash Outline

Text Mining Process• Feature Generation

– Text document is represented by the words it contains (and their occurrences)

• Order of words is not that important for certain applications (Bag of words)

– Stemming: identifies a word by its root• Reduce dimensionality

– Stop words: The common words unlikely to help text mining

Page 16: Text Mining Patrick Cash Outline

Text Mining Process• Feature Selection

– Reduce dimensionality• Learners have difficulty addressing tasks with high

dimensionality• Only interested in the information relevant to what is being

analyzed – Irrelevant features

• Not all features help

Page 17: Text Mining Patrick Cash Outline

Text Mining Process• Text Mining: Classification definition

– Given: a collection of labeled records (training set)• Each record contains a set of features (attributes), and

the true class (label)– Find: a model for the class as a function of the

values of the features– Goal: previously unseen records should be assigned

a class as accurately as possible

Page 18: Text Mining Patrick Cash Outline

Text Mining Process• Text Mining: Clustering definition

– Given: a set of documents and a similarity measure among documents

– Find: clusters such that:• Documents in one cluster are more similar to one another• Documents in separate clusters are less similar to one

another– Goal:

• Finding a correct set of documents clusters

Page 19: Text Mining Patrick Cash Outline

Text Mining Process• Supervised learning (classification)

– The training data is labeled indicating the class– New data is classified based on the training set– Correct classification: The known label of test sample is

identical with the class result from the classification model• Unsupervised learning (clustering)

– The class labels of training data are unknown– Establish the existence of classes or clusters in the data– Good clustering method: high intra-cluster similarity

Page 20: Text Mining Patrick Cash Outline

Text Mining Process• Analyzing the results

– Are the results satisfactory?– Does more mining need to be done?– Does a different technique need to be used?– Does another iteration of one or more steps in the

process need to be done?

Page 21: Text Mining Patrick Cash Outline

Text Mining Applications• Bioinformatics

– Genomics research (DNA sequencing)• Medical

– Mining medical records to improve care• Business intelligence

– Risk analysis• Research

– Analyzing research publications• Basically anywhere there is large amount of

unstructured text data

Page 22: Text Mining Patrick Cash Outline

Text Mining Application• Classification (Categorization)

– Spam detection, Document organization• Clustering

– Trend analysis, Topic identification• Web Mining

– Trend analysis, Opinion mining, Ontology creation• Classical NLP

– Text summarization, Question answering, Information extraction

Page 23: Text Mining Patrick Cash Outline

Text Mining Application• Smaller scale applications

– Relationship Analysis• If A is related to B, and B is related to C, there is

potentially a relationship between A and C.– Trend analysis

• Occurrences of A peak in October.– Mixed applications

• Co-occurrence of A together with B peak in November. (Shopping Cart Analysis)

Page 24: Text Mining Patrick Cash Outline

Challenges in Text Mining• Remember

– Text Mining = Statistical NLP + Data Mining• Text mining suffers from the same challenges as

Statistical NLP and Data Mining• Add in the additional difficulties associated with

the data not being structured

Page 25: Text Mining Patrick Cash Outline

Challenges in Text Mining• Statistical NLP

– Ambiguity– Context– Tokenization \ Sentence Detection– Stemming– POS Tagging– Coreference Resolution

Page 26: Text Mining Patrick Cash Outline

Challenges in Text Mining• Data Mining

– Data preprocessing• Ability to process the data• Massive amounts of data• Determining and extracting information of interest

– Availability of NLP tools to work with data mining– Discovery process

• No training data available

Page 27: Text Mining Patrick Cash Outline

Conclusion• Text Mining = Statistical NLP + Data Mining

– Culmination of all the NLP techniques covered in this course

• Growing research area that will be important as information growth (and need to extract knowledge from that information) increases

Page 28: Text Mining Patrick Cash Outline

References• Even-Zohar, Y. Introduction to Text Mining.

Supercomputing, 2002. http://alg.ncsa.uiuc.edu/do/documents/presentations

• Treloar, N AvaQuest Inc. www.knowledgetechnologies.net/proceedings/presentations/treloar/nathantreloar.ppt

• Witte, R. Faculty of Informatics Institute for Program Structures and Data Organization (IPD) http://www.edbt2006.de/edbt-share/IntroductionToTextMining.pdf

Page 29: Text Mining Patrick Cash Outline

Questions ?