Text MiningFernando GamaAcadêmico de Sistemas de Informação - UFPABolsista de IC (Instituto Tecnológico Vale - ITV)
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
Introduction Concept-Based Knowledge Discovery in Texts Extracted from the web
- Statistical Techniques are applied on concepts.
- To find patterns in concepts.
- For indentifying concepts in texts, applying Categorization algorithm.
- Classification task is associated with categorization algorithm for concept definitions.
Introduction Scene WEB:
growing collection of texts;people want useful information and extract informations quickly and with low cost;Problem: information overload problem! KDT(Knowledge Discovery Text): keywords should be previously assigned text.
Manually by humans
Software tools
Texts to categorize documents: find associations.
+ frequent = keyword(attributes)
vocabularyproblem
Terms
Introduction
Goal:to minimize the vocabulary problem.to minimize the effort necessary to extract useful information.the discovery process works over concepts extracted from texts.to combine categorization task and mining task.
Categorization: concepts presents.Mining: discovers patterns.
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
Approach for KDT
What is concept definition?
- Dictionary: idea, opinion, thought.
INFORMATION RETRIEVAL (IR)to index
to retrieve documents
“Concepts expressed by a language are determined by environment, culture of the
people who speak the language”.
Approach for KDT
The goals is:- building a simple structure, that allows to represent real world objects, events, thoughts,opinions, ideas, easily and with a certain degreeof quality for the discovery process.
To represent concepts internally.Concept is stored as set or vector of terms.Non-ordered vector: to simplify classification
and categorization task.
Approach for KDT
KDT approach against the KDD phases:1. Understanding the application domain and the goals of the data mining process.2. Selecting a target data set: texts must be gathered. (tools/manual).3. Integrating e checking the data set: texts must be saved(.*txt)4. Data cleaning, preprocessing and transformation: concepts must be described and texts need to analyzed and stored in the internal format.5. Model development and hypothesis building: identifying concepts in the collection.6. Choosing suitable data mining algorithims.7. Result interpretation and visualization: humans must interpret the conclusion.8. Result testing and verification.9. Using and maintaining the discovery knowledge: done by humans.
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
Categorization
The goal:Identify concepts present in texts.The approach: is based on a simple technique.
TermsConcepts Texts
Fuzzy reasoning
IE
Categorization
Rocchio Algorithm CategorizationBuild prototype-like vector to represent each class/category (concepts).
Advantages:Simple;Easy to implement;Relative efficiency.
Main Disavantage:Context of words doesn't influence the categorization.
Categorization
Rocchio Algorithm CategorizationOperation:
1. Concepts were defined? Texts were represented in the internal format?2. Compare all texts against each concept. (fuzzy)3. Common terms presents (Weigths multiplied).4. The overall sum = 1.
Concept Text0...1
Categorization
Rocchio Algorithm CategorizationIn approach:(TERM CONCEPT)Analyzing the strenght according indicatiors.
Abductive Reasoning:
“A B”If “A is truth” then we infer “B is truth”.Conclusion: Words that describe a concept appear in a text = high of that concept being present in that text.
Categorization
“A B”If “A is truth” then we infer “B is truth”.
Conclusion:Text Concept
Words
Set
Categorization
How are we decide whether a concept is present or not?
Categorization
Decision: depend threshold + context analysis.
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
Classification
The goal:Generate concept definitions (choice and description of each concept).
Is possible:- to use existing controlled vocabulary (dictionaries, thesauri, ontologies) / or automatically generate one.
Problems: 1. thesauri: very domain dependent and they don't have sufficient vocabulary coverage.2. ontologies: fail to include proper nouns.3. dictionary: sometimes don't include important semantic relations.
Preexisting vocabularies may not be appropriated to the user's need.
Classification
Automatic Generation of a controlled vocabulary:
Learning Process
Supervised process
Unsupervised process
Classification
Automatic Generation of a controlled vocabulary:
Supervised process: a set input data (training data). Analysis the set data and validation is applied on algorithms.
Unsupervised process: no one previously knowledge is available. Suggests the clustering technique. Learning across observation.
Classification
Problem: high-quality sample of data must beavailable.
Problem: classes are indetified and createdindepent user's interest.
Classification
In this approach:
The goal is use a method that could be efficient, low cost in terms of time and effort.
Manual Process + dictionaries and software tools.Technical dictionary + thesaurus.
examine Sample of colection words
frequency
context
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
Mining Task
The goal:- Analyzes concept distributions to discover interesting patterns.
- Probabilistic and statistical paradigm: distribution of variables in the collection.
- Assumming that only is important to know if a concept is present or not inside text.
Mining Task
The first technique:- Key-concept listing: analyzes concept distributions over the collection.
Goal: Allows for finding which dominant themes exist in a collection or in a single text.
Text associated
concept degreeConcept 1Concept 2Concept 3
1
1
1How much number os texts to which the concept is assigned.
Mining Task
The second technique:- Association: discoveries associations between concepts expresses thesefindings.
1. Suport: proportion of texts that have x and y in relation to all texts in the collection.2. Confidence: proportion of texts that have x and y in relation the number of texts that have only x.
Mining Task
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
Experiments
Two experiments:- Political analysis context;- and competitive intelligence (business intelligence);
:: Classification task was different in each experiment.:: These experiments are complementary.
Experiments
Political experiments:- Exhaustive Analysis of words.- Each word being examined could be classified into an existing concept or generate a new one.- Stopwords and general terms were eliminated previously.
Competitive experiment:- Interesting concepts were first selected.- Each concept were defined and refined for examination of words present in the collection.
Experiments
In these experiments:- Support equals to 60% - and confidence threshold of 80%
Experiments
Political Experiments- Goal of this experiment: extract knowledge about what press is or was telling about the mayor of a big city in Brazil.
Newspaper Portuguese180 texts
178 texts
1997
1999Sub-collections
Experiments
Association rules (association technique)
a) drug traffic politicians (confidence = 93.3%, support = 14 documents)
b) loans politicians (confidence = 82.1%, support = 14 documents)- discovery in 1997's sub-collection.
Sphere Political
importance degree
Experiments
Association rules (association technique)
c) combination of 2 patterns:
“education ” and “loans” can have connection.
(1) loans politicians (confidence = 82.1%, support = 23 documents)(2) education politicians (confidence = 64.2%, support = 27 documents)
(3) education loans (confidence = 4.7%, support = 2 documents)(3) loans education (confidence = 7.1%, support = 2 documents)
Experiments
However...
Experiments
When analyzing these two concepts together:
(5) loans AND education politicians (confidence = 83.3%, support = 5 documents)
(6) loans AND politicians education (confidence = 17.2%)
Experiments
When analyzing these two concepts together:
(5) loans AND education politicians (confidence = 83.3%, support = 5 documents)
(6) loans AND politicians education (confidence = 17.2%)
Experiments
Concept distributions (key-concept listing technique)
Whole collection = 358 texts.
Comparing the distributions of concepts (1997 and 1999):1. 1997 (dominant focus): presence of politicians associated with the mayor while in 1999 the themes had a balanced distribution.2. the weight of the “elections” concept: 1997 = 25% and 1999 = 33.7%.3. the “debts” concept reduced its participation from 1997 to 1999.
And so on...
Politicians140 texts
39.1%
Crimes117 texts
32.6%
Elections105 texts
29.3%
Experiments
Competitive Intelligence Experiments- Goal of this experiment: compare Text Mining tools, examining the techniques used and the benefits cited by the vendors os these tools. Addition, to relate techniques and benefits, in order to discover which techniques to use when needing a certain benefit.
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Retrieval- It's has been developing in parallel with database systems.However:Database => processing structured data.Information Retrieval (IR) => organization and retrieval of information.Handle different kinds of data.IR has found many applications.IR (Problems): to locate relevants documents in a document collection based on user's query.
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Retrieval (Measures for Retrieval)- Quality of text retrieval:
Relevant documents
Retrieved documents
Relevant and Retrieved
All documents Venn Diagram
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Retrieval (Measures for Retrieval)- Measure the quality of a ranked list of documents:
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Text Indexing Techniques- Text retrieval indexing techniques.Inverted Index: index structure that maintains two hash indexed: document and term table.Document table => consists of a set of documents records. (doc id + posting list).Term table => consists of a set of documents records. (term id + posting list).
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Text Indexing TechniquesAdvantages: - Widely used in industry.- Easy to implement.Disadvantages: - Posting list is not handling synonymy, polysemy.- storage requirement large.
Signature file:Store a signature record for each document in the database.Hold a “signature” record store in main file.
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Text Indexing TechniquesAssignature file:Advantages: - Little storage space.
alta
alegre
elegante
espirituoso
esperto
forte
gracioso
envolvente
ousado
1 0 1 1 0 0 0 1 1
0 1 1 0 1 0 1 0 1
0 1 0 1 1 1 0 1 0
João
Maria
Pedro
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Query Processing Techniques
DB NOSQL
Problems:Synonymy => automobile and vehicle.Polysemy => same keywords but mean different.
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information ExtractionInformation Extraction (IE) is a process of extracting from documents, facts about types of events, entities or relationships. These facts are then usually entered automatically into a database or spreadsheet, which may then be used to analyze the data for trends, to give a natural language summary, or may be used for indexing purposes in Information Retrieval (IR) applications.
Information Retrieval:Finds texts and
presents them to user.
Information Extraction:Analyzes texts presents
according specific informations
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Extraction
Information Retrieval
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Extraction: Layer model of the Text Mining Application
Notice these aspects...
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Extraction - StemmingIdentifying the root of a certain word.
Derivational: create a new word from an existing word.
Inflectional: normalization is limited to regularizing grammatical variants(sing/plu or past/pres).
Eg. apply – applied- appliesprint – printing – prints – printed
Porter stemming algorithm:- minimize the effects of inflection;- morphological variations of Words.
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Extraction – Domain DictionaryIt's Necessary to provide them with a knowledge base.
The structure of Domain Dictionary: 3 levels hierarchy:
Parent Category + Sub-category + word.
Main category, will be unique on its level.
Belong to a certain parent category.All words associated with it.
Dependant of thecategories previously.
A survey of Text Mining: Retrieval, Extraction and Indexing Techniques
Information Extraction – Exclusion ListA lot of words in a text file can be treated as unwanted noise.
Necessary to eliminated them: separate file which includes all such words.
Words such as: the, a, an, if, off, on, in, etc...
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed
Database Conclusion Credits
Discovery Co-relations on Research Topics and Authors from the PubMed Database
1. PubMedIs a free search engine that provides very full coverage of related biomedical sciences, such as biochemistry and cell biology. It also offers access to the MEDLINE database with citations and abstracts of biomedical research articles.
1.1. PubMed data structure- + 17 millions of citations with the same structure.- files are intended for automatic processing. (XML).- 30.000 PubMed citations = XML instance defined by a DTD.
Discovery Co-relations on Research Topics and Authors from the PubMed Database
Set of Information such as:PubMed Identifier + Publication year + Mesh terms + Author's name.
Parser has been developed:
1.2. Generating a keyword file
Parser
PubMed file 1PubMed file 2PubMed file 3
PubMed file 4PubMed file 5PubMed file 6
New texts files One citation entry17191901, 2004, Erythrina, Plant Extracts, Plant Roots,chemistry, isolation purification, TANAKA_H, HIRATA_M, ETOH_H,SATO_M, MURATA_J, MURATA_H, DARNAEDI_D, FUKAI_T
keywords + authors
Discovery Co-relations on Research Topics and Authors from the PubMed Database
MeSH
1.2. Generating a keyword file
NLM CreatorDATAMantainer
Provider
keywordsanalysis
frequency
- Primary concepts and alternative descriptions.
- of types occurences and medical terms.
- terms MeSH.
Discovery Co-relations on Research Topics and Authors from the PubMed Database
2. Pre-Processing the Data
- Datasets obtained for the year (2003,2004,2005).- SQL Server 2005 Database.
OPERATIONS: 1. removing noise from data (irregular characters).2. organize data for more efficient access.
Discovery Co-relations on Research Topics and Authors from the PubMed Database
2. Pre-Processing the Data
DsPM: input dataset through the parsing of the PubMed XML files.
Top-5-KW and Top-1-A: indicate most frequency.
Discovery Co-relations on Research Topics and Authors from the PubMed Database
3. Mining the PubMed data
Association Rule (AR): A C
Support : number of database entries where this rule appears.Confidence : probability that an entry in DB that contains A will also contain C.
Dependency Networks(DN): are graphical models that represent joint distributions for a set of variables. DN are useful to learn and describe probabilistic relationships on data.
Discovery Co-relations on Research Topics and Authors from the PubMed Database
3. Mining the PubMed data
Discovering Dependency Networks(DN)
Discovery Co-relations on Research Topics and Authors from the PubMed Database
3. Mining the PubMed data
Discovering Dependency Networks(DN)
High Confidence value: high probabilityof co-occurrence of author of consequent.
High Lift value: when antecendent occurthere is high probability of co-occurrence of author of consequent.
What will we approach? Introduction Approach KDT Categorization Classification Mining Task Experiments A survey of Text Mining: Retrieval, Extraction and Indexing Techniques Discovery Co-relations on Research Topics and Authors from the PubMed Database Conclusion Credits
Conclusion We can saw that: Choose to use a manual task + automatic tools + existing vocabularies. Automatic methods: can help user to find terms related to categories, lexical variations, local
synonymous, frequencies, etc. Human Intervention is important!To evaluate the categorization method, formal experiments were carried out: Texts extracted from web were gathered:
5 TOOLS 13 concepts8 tasks
5 methods
Conclusion Microaveraging precision = 0.59; Macroaveraging precision = 0.54; Microaveraging recall = 0.95; Macroaveraging recall = 0.86; Fallout = 0.62;
Microaveraging precision = 0.65; Macroaveraging precision = 0.69; Microaveraging recall = 0.89; Macroaveraging recall = 0.93; Fallout = 0.28;
Macroaveraging precision:0.61Macroaveraging recall:0.97
Classification
Negative terms + ambiguous words
CreditsFERREIRA, P, G.; LIBRELOTTO, Giovani; ALVES, Ronnie. Discovering Co-Relations on Research Topics and Authors from the PubMed Database.
LOH, S.; WIVES, L.K.; OLIVEIRA, J.P.M. Concept-Based Knowledge Discovery in Texts Extracted from the Web.
SAGAYAM, R.; Srinivasan, S.; Roshni.S. A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques.
Physicsandcake. Teaching Artificial Intelligences using Quantum Computers. Disponível em: <http://dwave.wordpress.com/2011/05/27/teaching-artificial-intelligences-using-quantum-computers/>. Acesso 06/10/2013.
Cat Casey. Predictive Analytics and Artificial Intelligence... Science Fiction or E-Discovery Truth?. Disponível em: <http://hudsonlegalblog.com/e-discovery/predictive-analytics-artificial-intelligence-science-fiction-e-discovery-truth.html>. Acesso em 07/10/2013.
Traina, Ribeiro, Cordeiro, Romani, Sousa, Avila, Zullo, Traina, Rodrigues. How to Find Relevant Patterns in Climate Data: an Efficient and Effective Framework to Mine Climate Time Series and Remote Sensing Images. Disponível em: <http://www.gbdi.icmc.usp.br/agrodatamine/node/33>. Acesso em:07/10/2013.