Text Mining Patrick Cash Outline

Text Mining

Patrick Cash

Outline• Introduction• Data Mining• Text Mining

– Text Mining Process• Text Mining Applications • Challenges in Text Mining• Conclusion

Introduction• Why Text Mining?

– Massive amount of new information being created• World’s data doubles every 18 months (Jacques Vallee

Ph.D)– 80-90% of all data is held in various unstructured

formats– Useful information can be derived from this

unstructured data

Introduction• Intelligence in text mining is based on NLP

techniques• Can be used as a preprocessing technique to

harvest data and get an initial understanding of the patterns that exist in the data

• Often seen as a special case of data mining but there is an important difference

Data Mining• Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information (or patterns) from data

• Data Mining: a misnomer?– Knowledge discovery, knowledge extraction,

data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Data Mining• Descriptive: understanding underlying processes

or behavior – Patterns and trends– Clustering

• Predictive: predict an unseen or unmeasured value – Future projections and missing values– Classification

SEMMA• Search

– Input data source, data sampling, partitioning• Explore

– Patterns, trends, outliers, visualization• Modify

– Clustering, feature reduction• Model

– Regression, tree, network• Assess

– Report, pass to next step in analysis

Search vs. Discover

Data Mining

Text Mining

DataRetrieval

InformationRetrieval

Search(goal-oriented)

Discover(opportunistic)

StructuredData

UnstructuredData (Text)

Text Mining• Many different by similar definitions• Text Mining = Statistical NLP + Data Mining• Text Mining is a process that employs

– Statistical NLP: a set of algorithms for converting unstructured text into structured data objects

– Data Mining: the quantitative methods that analyze these data objects to discover knowledge

Text Mining• Descriptive

– Pattern and trend analysis– Knowledge base creation– Summarization– Visualization

• Predictive– Classification – Question answering– Pattern and trend forecasting

Text Mining Techniques• Information Retrieval

– Indexing and retrieval of textual documents• Information Extraction

– Extraction of partial knowledge in the text• Web Mining

– Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building)

• Clustering– Generating collections of similar text documents

Text Mining Process

Text Mining Process• Text Preprocessing

– Syntactic/Semantic text analysis • Features Generation

– Bag of words • Features Selection

– Simple counting– Statistics

• Text/Data Mining– Classification (Supervised) / Clustering (Unsupervised)

• Analyzing results

Text Mining Process• Text preprocessing

– Part Of Speech (POS) tagging• Find the corresponding POS for each word.

– Word sense disambiguation• Context based or proximity based

– Parsing• Generates a parse tree (graph) for each sentence• Each sentence is a stand alone graph

Text Mining Process• Feature Generation

– Text document is represented by the words it contains (and their occurrences)

• Order of words is not that important for certain applications (Bag of words)

– Stemming: identifies a word by its root• Reduce dimensionality

– Stop words: The common words unlikely to help text mining

Text Mining Process• Feature Selection

– Reduce dimensionality• Learners have difficulty addressing tasks with high

dimensionality• Only interested in the information relevant to what is being

analyzed – Irrelevant features

• Not all features help

Text Mining Process• Text Mining: Classification definition

– Given: a collection of labeled records (training set)• Each record contains a set of features (attributes), and

the true class (label)– Find: a model for the class as a function of the

values of the features– Goal: previously unseen records should be assigned

a class as accurately as possible

Text Mining Process• Text Mining: Clustering definition

– Given: a set of documents and a similarity measure among documents

– Find: clusters such that:• Documents in one cluster are more similar to one another• Documents in separate clusters are less similar to one

another– Goal:

• Finding a correct set of documents clusters

Text Mining Process• Supervised learning (classification)

– The training data is labeled indicating the class– New data is classified based on the training set– Correct classification: The known label of test sample is

identical with the class result from the classification model• Unsupervised learning (clustering)

– The class labels of training data are unknown– Establish the existence of classes or clusters in the data– Good clustering method: high intra-cluster similarity

Text Mining Process• Analyzing the results

– Are the results satisfactory?– Does more mining need to be done?– Does a different technique need to be used?– Does another iteration of one or more steps in the

process need to be done?

Text Mining Applications• Bioinformatics

– Genomics research (DNA sequencing)• Medical

– Mining medical records to improve care• Business intelligence

– Risk analysis• Research

– Analyzing research publications• Basically anywhere there is large amount of

unstructured text data

Text Mining Application• Classification (Categorization)

– Spam detection, Document organization• Clustering

– Trend analysis, Topic identification• Web Mining

– Trend analysis, Opinion mining, Ontology creation• Classical NLP

– Text summarization, Question answering, Information extraction

Text Mining Application• Smaller scale applications

– Relationship Analysis• If A is related to B, and B is related to C, there is

potentially a relationship between A and C.– Trend analysis

• Occurrences of A peak in October.– Mixed applications

• Co-occurrence of A together with B peak in November. (Shopping Cart Analysis)

Challenges in Text Mining• Remember

– Text Mining = Statistical NLP + Data Mining• Text mining suffers from the same challenges as

Statistical NLP and Data Mining• Add in the additional difficulties associated with

the data not being structured

Challenges in Text Mining• Statistical NLP

– Ambiguity– Context– Tokenization \ Sentence Detection– Stemming– POS Tagging– Coreference Resolution

Challenges in Text Mining• Data Mining

– Data preprocessing• Ability to process the data• Massive amounts of data• Determining and extracting information of interest

– Availability of NLP tools to work with data mining– Discovery process

• No training data available

Conclusion• Text Mining = Statistical NLP + Data Mining

– Culmination of all the NLP techniques covered in this course

• Growing research area that will be important as information growth (and need to extract knowledge from that information) increases

References• Even-Zohar, Y. Introduction to Text Mining.

Supercomputing, 2002. http://alg.ncsa.uiuc.edu/do/documents/presentations

• Treloar, N AvaQuest Inc. www.knowledgetechnologies.net/proceedings/presentations/treloar/nathantreloar.ppt

• Witte, R. Faculty of Informatics Institute for Program Structures and Data Organization (IPD) http://www.edbt2006.de/edbt-share/IntroductionToTextMining.pdf

Questions ?