Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis Frizo Janssens, Wolfgang Glänzel, and Bart De Moor Presented by Cindy Burklow CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17 th , 2008
33
Embed
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis. Frizo Janssens, Wolfgang Glänzel, and Bart De Moor. Presented by Cindy Burklow. CS 685: Special Topics in Data Mining Professor Dr. Jinze Liu University of Kentucky April 17 th , 2008. Outline. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis
Frizo Janssens, Wolfgang Glänzel, and Bart De Moor
Presented by Cindy Burklow
CS 685: Special Topics in Data MiningProfessor Dr. Jinze LiuUniversity of KentuckyApril 17th, 2008
OutlineIntroductionMotivationRelated WorkProposed ModelsProposed AlgorithmsResults: Hybrid & Dynamic ClusteringDiscussion of Pros and ConsQuestionsReferences
IntroductionBioinformatics …
◦Computer Science◦ Information Technology◦Solves problems in Biomedicine
Goal of Paper: Investigate◦Cognitive structure◦Dynamics of bioinformatics core◦Sub-disciplines◦ ISI Web of Science & MEDLINE◦Retrieval of core literature in
bioinformatics
MeSH = Medical Subject Headings
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 360, 368, KDD '07. ACM, San Jose, CA, August 2007.
MotivationBioinformatics field …
◦Dynamic ◦Evolving discipline ◦Fast growth rate
Monitor current trendsPredict future directionDecision Making
◦Grants◦Business Ventures◦Research Opportunities
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.
Related WorkWeb miningBibliometricsText mining & citation analysis
◦Mapping of knowledge◦Charting science & technology fields
Textual & graph-based approaches◦Different perceptions of similarity
between documents or groups of documents
Related Work
Establishing the Data SetPatra & Mishra – Bibliometric Study
◦MeSH term based◦Liberal delineation strategy with
maximal recall◦Broader interpretation of
bioinformatics◦Less restricted search strategy◦Broader coverage of underlying
database◦14,563 journal papers
Related WorkHybrid Clustering
◦He – Unsupervised spectral clustering of web pages
◦Wang & Kitsuregawa – Contents-linked coupled clustering algorithm of web pages
Dynamic hybrid clustering◦Mei & Zhai – Temporal Text Mining◦Kullback-Leibler – Divergence for coherent
Models – ClusteringOptimal number of clustersCombine Distance-based & Stability-based
Methods Strategy
Dendrogram observation
Silhouette Curves: Mean text andCitation-based
Stability DiagramImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, 365, KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Hybrid Clustering
Cluster Input: DistancesCombining text mining and
bibliometrics◦Integrate text & citation info early in
mapping process before applying of clustering algorithm
Weighted linear combination
Fisher’s inverse chi-square methodImage Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 362, 363, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 363 KDD '07. ACM, San Jose, CA, August 2007.
Proposed Algorithm – Dynamic Hybrid ClusteringGoal: Match & track clusters through
timeProcess:
◦Separate hybrid clustering for each period◦Determine optimal number of clusters
◦Construct complete graph All cluster centroids from each period as nodes Edge weights as mutual cosine similarities in LSS
◦Form Cluster Chains Keep edge weights > threshold, T1 Allow qualifying clusters to join > threshold, T2
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Results – Hybrid ClusteringSilhouette Curve
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringSilhouette Curve
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 364, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringStability
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringDendrogram
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Hybrid ClusteringCluster Characterization
RNA structure prediction
205
Protein structure prediction
1167Systems biology & molecular networks
694
Phylogeny &
Evolution
749Genome
sequencing &
assembly
640
Gene / promoter /
motif prediction
995
Molecular
DBs & annotation platforms
1091Multiple
sequence alignment
713
Microarray analysis
1147
Result – Dynamics ClusteringHistogram
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 365, KDD '07. ACM, San Jose, CA, August 2007.
Result – Dynamics ClusteringCluster Chains
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 367, KDD '07. ACM, San Jose, CA, August 2007.
Yearly Publication Outputamong Cluster chains
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Dynamic TermNetwork
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 368, KDD '07. ACM, San Jose, CA, August 2007.
Pros & ConsPros
◦Offers fresh perspective on clustering
◦Integrates various techniques◦Provides insight into bioinformatics
Cons◦Challenge of selecting the optimal
number of clusters still exists◦There are many steps required to
implement their approach
Questions
References Janssens, F., Glänzel, W., and De Moor, B. 2007. Dynamic
hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In Proceedings of the 13th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Jose, California, USA, August 12 - 15, 2007). KDD '07. ACM, New York, NY, 360-369. DOI= http://doi.acm.org/10.1145/1281192.1281233
ISI Web of Science Image: http://apps.isiknowledge.com/WOS_GeneralSearch_input.do?highlighted_tab=WOS&product=WOS&last_prod=WOS&SID=3DamC8GFDKmpBLhFOIM&search_mode=GeneralSearch
PubMed Image: http://www.ncbi.nlm.nih.gov/pubmed/ The Apache Jakarta Project: