Top Banner
IndoUS DL 2003 6/23/03 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa Iowa City, IA [email protected] *Students:Aditya Sehgal, Xin Ying Qiu
26

6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

Dec 16, 2015

Download

Documents

Eleanore Walsh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Text Metadata Mining: Exploring its potential*

Padmini Srinivasan

School of Library & Information Science

The University of Iowa

Iowa City, IA

[email protected]

*Students:Aditya Sehgal, Xin Ying Qiu

Page 2: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Outline

1. Text Mining

2. Metadata-based Topic profiles

3. Function: Exploring topic characteristics via profiles

Problem: Study disease research prevalence

4. Conclusions

Page 3: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

1. Text Mining: Novelty and Usefulness

Assist researchers with hypothesis generation,

exploration, and testing.

Discover knowledge that is ‘novel’ at

least relative to the text collection

Discover knowledge that is potentially

‘useful’

Extract patterns, explore relationships

Propositions/Hypotheses: need follow up

verification

Page 4: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Of all 45 studies in Medline on chemical X, 80% have been done in the context of disease L, 10% disease M and the remainder in the context of disease N.

Gene A is known to be associated with disease X. The literature suggests that gene B shows some key ‘similarities’ to A and therefore B may also be associated with X.

Examples

Page 5: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Support content organization and managementProvide access to content

Dublin Core Metadata InitiativeRDF: Resource Description Framework

Library of Congress Subject Headings (LCSH)Medical Subject Headings (MeSH)

Question: Can we use metadata for text mining and knowledge discovery?

Given a topic, eg. ‘Toxic waste’ and a collectionof texts such as Medline..

Metadata in Digital Libraries

Page 6: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Describe topics: topic profiles built from the textcollection being mined ~ metadata profiles

- Compare topics via their profiles: a. topic similarityb. trends over specific features/characteristics

- Look for indirect links between topics

- Given a topic look for related topics.

Metadata for Text Mining

Page 7: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

MeSH Phrase

MeSH Qualifier

Example MEDLINE Record

Page 8: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

MeSH Metadata

Semantic Types

Aldehydes

Organic Chemical

Protein Isoprenylation

Genetic Function(134)

(22,000)

Formaldehyde

Chemical

Page 9: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

2. Topic Profiles

A set of terms that characterize the topic with weightsassigned to represent their relative importance.

{Medline: A vector of MeSH term vectors - one for each of the134 semantic types.}

Page 10: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Topic: “hip fractures in the elderly”

Search against Pubmed: (geriatrics or elderly) AND hip fractures

Extract MeSH metadata terms from retrieved documents

Build weighted profile: vector of vectorscan be limited to MeSH terms of particular semantic types

Page 11: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Example Profile: Raynauds disease

Page 12: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Comparing topics via their profiles

Topic 1: PubMed search Topic 2: PubMed search

MeSH Profile MeSH Profile

documentsdocuments

(cosine similarity)13,000 genes

Page 13: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Comparing topics - studying particular characteristics in their profiles

Problem:To study the prevalence of disease research.

‘geographical context’.

Page 14: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Topic: “cholera”

Search against Pubmed:

Extract MeSH metadata terms from retrieved documents

Build weighted profile vectorscan be limited to MeSH terms in ‘Geographical Area’

Cholera: {0.6 Nigeria, 0.1 Malyasia , ……}Breast Cancer: {0.1 Poland, 0.8 Italy, ……}

Rank nations

Page 15: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Research Prevalence: Mental Disorders (1961-2000)Ranking nations.

Page 16: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Research Prevalence: Cholera (middle & low income;1991 - 2000) Ranking nations

Page 17: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Research prevalence versus disease prevalence

For each disease:(a) Rank nations by Disease Prevalence (WHO epid. data)- estimated by # of cases reported or # of deathsStatistical Information Systemweekly epidemiological records

(b) Rank nations by Research Prevalence

Compare rankings using Spearman’s rank coefficient.

Analysis limited to the decade of the 90s.

Question: So how does the prevalence of research compare with the prevalence of the disease?

Page 18: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Breast cancer Cholorectal cancerHodgkins disease MeningitisDengue TuberculosisLiver neoplasms Prostate cancerOvarian cancer Esophagus cancerCholera AIDSStomach cancer MelanomaLeprosy MalariaYellow fever TrypanosomiasisDracunculiasis

19 diseases

Page 19: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Disease Income N CC

Breast Cancer

All

High

Medium

low

168

35

71

61

0.645*

0.856*

0.709*

0.372*

Hodgkins All

High

Medium

low

165

34

70

61

0.539*

0.71*

0.545*

0.386*

*0.05 sig.level

Page 20: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Observations:

Diseases most prevalent in high or middle income group, have significant +ve correlation (9/10 diseases)

Diseases most prevalent in low income groupsignificant +ve correlation less likely (4/9, 44%).

Page 21: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Temporal analysis on disease research

Extract the top 3 ranked diseases studied in thecontext of each nation

Pool these together

How often does a disease rank in the top 3 positions?

Page 22: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Topic: Each nation

Sweden: {0.6 Breast Cancer, 0.1 Malaria , ……}

Nigeria: {0.1 Breast Cancer, 0.8 Malaria, ……}Rank diseases

Page 23: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Pooling: (for each decade & each incomegroup)

Page 24: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Page 25: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Observations from the study:

Collecting epidemiological data is extremely complicated.

Collect it at a fine grained analysis. Different forms ofLeishmaniasis; Plague

Complement existing efforts at collecting epidemiologicaldata.

Consider more complex phenomena such as the prevalenceof Leishmania and HIV as co-infections.

Research based evidence to explore policy issues.

Page 26: 6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.

IndoUS DL 2003 6/23/03

Conclusions:

Metadata can be exploited for text mining

MeSH ~ rich metadata scheme

Importance of metadata for digital libraries

Other text mining applications built on DL?

Domain independent ~ accounting!

Thank you!