Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Frizo Janssens, Wolfgang Glänzel, and Bart De Moor

Presented by Cindy Burklow

CS 685: Special Topics in Data MiningProfessor Dr. Jinze LiuUniversity of KentuckyApril 17th, 2008

OutlineIntroductionMotivationRelated WorkProposed ModelsProposed AlgorithmsResults: Hybrid & Dynamic ClusteringDiscussion of Pros and ConsQuestionsReferences

IntroductionBioinformatics …

◦Computer Science◦ Information Technology◦Solves problems in Biomedicine

Goal of Paper: Investigate◦Cognitive structure◦Dynamics of bioinformatics core◦Sub-disciplines◦ ISI Web of Science & MEDLINE◦Retrieval of core literature in

bioinformatics

MeSH = Medical Subject Headings

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.

MotivationBioinformatics field …

◦Dynamic ◦Evolving discipline ◦Fast growth rate

Monitor current trendsPredict future directionDecision Making

◦Grants◦Business Ventures◦Research Opportunities

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Related WorkWeb miningBibliometricsText mining & citation analysis

◦Mapping of knowledge◦Charting science & technology fields

Textual & graph-based approaches◦Different perceptions of similarity

between documents or groups of documents

Related Work

Establishing the Data SetPatra & Mishra – Bibliometric Study

◦MeSH term based◦Liberal delineation strategy with

maximal recall◦Broader interpretation of

bioinformatics◦Less restricted search strategy◦Broader coverage of underlying

database◦14,563 journal papers

Related WorkHybrid Clustering

◦He – Unsupervised spectral clustering of web pages

◦Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages

Dynamic hybrid clustering◦Mei & Zhai – Temporal Text Mining◦Kullback-Leibler – Divergence for coherent

themes & Hidden Markov Models◦Griffiths & Steyvers – Latent Dirichlet

Allocation with hot topics in PNAS abstracts

Models: Data SetBibliometric Retrieval StrategyNovel subject delineation

strategy◦Retrieve core literature◦Combines textual components &

bibliometrics, citation-based techniques

◦Web of Science Edition of Thomson Scientific 7401 bioinformatics-related papers 1981 to 2004 Titles, abstracts, author keywords, and

MeSH terms

Models – Text Analysis◦All text was indexed with Jakarta Lucene

Platform◦Encoded in Vector Space Model using TF-

IDF weighting scheme◦Text-based similarities

Cosine of angle between the vector representations of two papers

◦No Stop word used during indexing◦Porter Stemmer

All remaining terms from titles and abstracts

◦Bigrams Candidate list of MeSH descriptors, author

keywords, and noun phrases

◦Latent Semantic Indexing (LSI) – 10 terms

Models – Citation Analysis

Citation GraphsLink-based algorithms

◦HITS◦PageRank

Representative Publications

Text-based

Co-citation

Citation-based

Documents

QUANTIFY SIMILARITIES

Boolean Input

Vectors

CosineBibliographic coupling

Combine

Image Reference: Google Logo from http://www.google.com

Models – ClusteringAgglomerative Hierarchical

Clustering Algorithm with Ward’s Method

Hard Clustering Algorithm: ◦Every publication is assigned to exactly 1 cluster.

Image Reference: Clustering Analysis - http://en.wikipedia.org/wiki/Data_clustering

Models – ClusteringOptimal number of clustersCombine Distance-based & Stability-based

Methods Strategy

Dendrogram observation

Silhouette Curves: Mean text andCitation-based

Stability DiagramImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.

Proposed Algorithm – Hybrid Clustering

Cluster Input: DistancesCombining text mining and

bibliometrics◦Integrate text & citation info early in

mapping process before applying of clustering algorithm

Weighted linear combination

Fisher’s inverse chi-square methodImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.

Proposed Algorithm – Dynamic Hybrid ClusteringGoal: Match & track clusters through

timeProcess:

◦Separate hybrid clustering for each period◦Determine optimal number of clusters

Dendrogram Silhouette curve Ben-hur stability plot

◦Construct complete graph All cluster centroids from each period as nodes Edge weights as mutual cosine similarities in LSS

◦Form Cluster Chains Keep edge weights > threshold, T1 Allow qualifying clusters to join > threshold, T2

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.

Results – Hybrid ClusteringSilhouette Curve

Result – Hybrid ClusteringSilhouette Curve

Result – Hybrid ClusteringStability

Result – Hybrid ClusteringDendrogram

Result – Hybrid ClusteringCluster Characterization

RNA structure prediction

Protein structure prediction

1167Systems biology & molecular networks

Phylogeny &

Evolution

749Genome

sequencing &

assembly

Gene / promoter /

motif prediction

Molecular

DBs & annotation platforms

1091Multiple

sequence alignment

Microarray analysis

Result – Dynamics ClusteringHistogram

Result – Dynamics ClusteringCluster Chains

Yearly Publication Outputamong Cluster chains

Dynamic TermNetwork

Pros & ConsPros

◦Offers fresh perspective on clustering

◦Integrates various techniques◦Provides insight into bioinformatics

Cons◦Challenge of selecting the optimal

number of clusters still exists◦There are many steps required to

implement their approach

Questions

References Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic

hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233

ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highlighted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8GFDKmpBLhFOIM&search_mode=GeneralSearch

PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ The Apache Jakarta Project:

http://lucene.apache.org/java/1_4_3/ Fisher’s Method: http://en.wikipedia.org/wiki/Fisher

%27s_method “Data Mining - Concepts and techniques” by Han and Kamber,

Morgan Kaufmann, 2006. (ISBN:1-55860-901-6)

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Documents

Incorporating Spatial Similarity into Ensemble...

Data Mining in Bioinformatics Day 8: Clustering in...

Clustering Techniques - Lehigh...

FUZZY C-MEANS CLUSTERING BY INCORPORATING BIOLOGICAL...

Dynamic Hybrid Clustering of Bioinformatics by...

An Overview of Clustering Methods With Applications to...

Reports - mli.gmu.edu · knowledge mining developed in our....

Scatter/Gather Clustering: Flexibly Incorporating User...

BMC Bioinformatics BioMed Central - Virginia Tech · BMC...

BINF 636: Lecture 9: Clustering: How Do They Make and...

Extracting information from European Bioinformatics...

BMC Bioinformatics BioMed Central -...

Www..uni-rostock.de Ulf Schmitz, Pattern recognition -...

Bioinformatics Research Overview Li Liao Develop new...

Yang Ruan Advised by Geoffrey Fox. Motivation Bioinformatics...

Evaluation of gene-expression clustering via mutual...