텍텍텍텍텍텍 텍텍텍텍 텍텍 텍텍텍텍텍텍텍텍 텍텍 (Detecting Bioinformatics by Text Mining Techniques) Min Song, PhD Associate Professor Department of Library and Information Science Yonsei University
텍스트마이닝 기법들을 통한 생물정보학분야의 이해 (De-tecting Bioinformatics by Text Mining Techniques)
Min Song, PhDAssociate Professor
Department of Library andInformation Science
Yonsei University
Outline• Introduction and Background• Research Problem• Methods
• Data Processing• Topic Modeling• Citation Analysis• Identification of Important Articles by PageRank• Visualization
• Results & Discussion• Summary & Future Work
Introduction
• Bioinformatics has grown into the cross-disciplinary field and proliferated into new areas of life Sciences• 400,000 biological researchers – worldwide• sequencing industry to grow from $1.5B to $100B in 20 years (NextGen Informatics, 2011) • Increasing number of biological databases including PubMed and PubMed
Central
• Understanding the trends in and the structure ofBioinformatics is increasingly important
• Bibliometric analysis has been applied to Bioin-formaticsfor this purpose (Glänzel et al., 2009; Bansard et al., 2007; Huang et al., 2010)
Research Problem• Bibliometric analysis utilizes quantitative
analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh, 1996)
• Problems of Current Approaches
• The current bibliometric analysis relies primarily on Thomson’s Web of Science product which results in the following problems:
• Manually processing citation data
• Incomplete coverage
• Only use citation analysis• Can’t handle big data
Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.
Goal
• Detecting the trends in and the structure of the field ofBioinformatics• We introduce novel techniques to detect the
knowledge structure of and trends in Bioinformatics by Text Mining techniques and automated citation analysis
• Mining PubMed Central full-text with • topic modeling• word co-occurrence• named entity recognition• MeSH
• Author co-citation analysis• Visualization
What is PubMed Central?• PubMed Central (PMC) is the U.S. National
Library of Medicine's digital archive of biomedical and life sciences journal literature
• Provides free and unrestricted access (XML format)
• Integrates journal literature with other valuable information resources in the NCBI database family (e.g., PubMed, Nucleotide, Protein)
• Launched in February 2000 • 383 journals, 1,512,652 articles, 4.3m unique
visitors in April 2008
Citation Analysis
• Citation Graphs• Link-based algorithms
• PageRankRepresentative Publications
Text-based
Co-citation
Citation-based
Documents
QUANTIFY SIMILARITIES
Term co-occur-
rence
Topic model-ing
Biblio-graphic coupling
(BC)
Combine
Methods – Data Collections1. Advanced Bioinformatics 2. Algorithms for Molecular Biology 3. Biochemistry 4. BioData Mining 5. Bioinformatics 6. Bioinformation 7. BMC Bioinformatics 8. BMC Genomics 9. BMC Systems Biology 10. Briefings in Functional Genomics &
Proteomics 11. BMC Research Notes 12. Bulletin of Mathematical Biology 13. Cancer Informatics 14. Comparative and Functional Genomics 15. EURASIP Journal on Bioinformatics and
Systems Biology 16. The EMBO Journal
17. Evolutionary Bioinformatics 18. Genome Biology 19. Genome Medicine 20. Genomics 21. Genome Integration 22. Journal of Biotechnology 23. Journal of Biomedical Semantics 24. Journal of Proteome Research 25. Journal of Proteomics 26. Journal of Computer-Aided Molecular
Design 27. Journal of Computational Neuroscience 28. Journal of Molecular Biology 29. Journal of Molecular Modelling 30. Journal of Theoretical Biology 31. Mammalian Genome 32. Molecular & Cellular Proteomics 33. Molecular Systems Biology 34. Neuroinformatics 35. Pharmacogenetics and Genomics 36. Physiological Genomics 37. PLoS Computational Biology 38. PLoS Biology 39. PLoS Genetics 40. Protein Science 41. Proteomics 42. Source Code for Biology and Medicine 43. Statistical Methods in Medical Research 44. Theoretical Biology and Medical
Modelling 45. Trends in Biochemical Sciences 46. Trends in Biotechnology 47. Trends in Genetics
Total 20,869 articles from 47 Journals
Overall Procedure of Our Ap-proach
Parse PubMed Central
C i t a t i o n
R e l a t i o n a l D B
T e x t
R e l a t i o n a l D B Text Analysis
Word cooccurrence
MeSH term frequency
Topic Modeling with LDA
Link Analysis
Ranking important articles by PageRank
Detect Organization and Country with
NER
Author Co-citation Analysis
Research Productivity Analyis
MeSH = Medical Subject Headings
Word co-occurrence analysis and MeSH term frequency• Important concept identifications by word co-
occurrence • The most widely used measure of co-occurrence is
mutual information (MI)
• We use the log-likelihood ratio (LLR) in that it is more appropriate than MI in the treatment of a mixture of high-frequency bigrams and low-frequency bigrams
• Important concept identifications by MeSH Term • Counting MeSH terms assigned to each article• MeSH terms are not assigned to PubMed Central
• Mapping from PubMed Central to PubMed record andthen extract MeSH terms
Database Schema for a PubMedCentral Citation DB
Article
PK article_id
journal_title title year issue pmid
Authorship
PK author_id
first_name last_name
RelationAuthorArticle
PK author_idPK article_id
Citation
PK citation_id
id_citation_from id_citation_to
Affiliation
PK affiliation_id
person_id article_id country organization
Fulltext
PK pmid
title journal_title abstract introduction methodology
Topic Modeling• Topic Modeling by LDA
• We are to explore the salient topics in core literature of Bioinformatics.
• We use Latent Dirichlet Allocation (LDA) proposed in (Blei et al., 2003) for topic model generation
• LDA is a generative model that enables sets of observations to be accounted for by unobserved groups which explains similarity of documents in the collection
• In LDA, each group is described as a random mix-ture
over latent topics where each topic is a discrete distribution over the vocabulary of the collection
D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
NER-based Detection of Organization and Country• We apply a Named Entity Recognition (NER)
technique to identify country and organization from the text
Citation Analysis• Build a Citation Network from the Datasets
• 990,000 citation nodes from about 20,000 papers• Apply the PageRank algorithm to the network
to identify the important articles
Citation Network (Complexity and Social Networks, 2012)
PageRank - definition• u: a web page• Fu: set of pages u points to • Bu: set of pages that point to u• Nu=|Fu|: the number of links from u • c: a factor used for normalization
uBv vN
vRcuR
)()(
• The equation is recursive, but it may be computed by starting with any set of ranks and iterating the computation until it converges.
• The definition corresponds to the probability distribution of a random walk on the web graphs.
Results and Discussion• Term Co-occurrence Analysis
Keyword Word Co-occurrence and LLC ScoreGene gene expression - 36947.5, gene ontology - 4729.7, expressed genes -
4115.5, genes involved - 3423.9, gene regulation - 1314.1
Genome genome wide - 15485.4, whole genome - 5401.7, human genome - 2950.3, genome sequence - 1821.2, functional genomics - 1805.4
Expression expression patterns - 4231.7, expression profiles - 6517.0, expression data - 3546.1, expression levels - 3187.4
Data data sets - 6593.5, microarray data - 6305.9, expression data - 3546.1
Protein protein interaction - 4824.8, protein interactions - 3186.5, protein coding - 2841.8, protein protein - 2719.8
Algorithm clustering algorithm - 676.0, clustering algorithms - 585.0, new algorithm - 502.0, proposed algorithm - 416.7, alignment algorithms - 266.3
Database public databases - 1309.8, relational database - 1296.7, database search - 363.4
Computer computer simulations - 538.7, computer program - 317.2, computer aided - 278.6, computer science - 223.1, computational model - 221.9
Keywords with High Ranked Word Co-occurrence
Results and Discussion (Cont’d)
gene - expression 36947.5
amino - acid 16483.9
genome - wide 15485.4
high - throughput 14185.2
large - scale 10554
binding - sites 9450.1
factor - transcription 8580.7
saccharomyces - cerevisiae 7867.8
E - coli 6849.4
expression - profiles 6517
microarray - data 6305.9
expression - patterns 4231.7
expression - levels 3187.4
Top Ranked Word Pairs by LLC
Results and Discussion (Cont’d)• Out of 20,869 documents, there are 19,954 documents that have the
corresponding MEDLINE records (95.6% matching). In 19,954 documents, 8,412 documents have MeSH terms (42.2%)
MeSH Term FrequencyAnimals 5178Humans 4883Computational Biology 3070Algorithms 2980Gene Expression Profiling 2702Oligonucleotide Array Sequence Analysis 2192Software 2154Molecular Sequence Data 1868
Models, Biological 1579Computer Simulation 1568Mice 1511Sequence Analysis, DNA 1489Base Sequence 1374Genomics 1344Evolution, Molecular 1336Databases, Genetic 1325Models, Genetic 1289Sequence Alignment 1278Proteins 1135
Results and Discussion (Cont’d) -Topic Model-ing
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5identification Model human expression dataSignaling Gene cells profiling timeUsing Mapping detection regulatory informationcerevisiae Protein pathway specific proteinsaccharomyces Human protein mouse classificationGenes Structural from transcriptional fromSmall Between dna molecular highNon computational analysis dynamic analysisalignment Binding stem regulation massDna Based elegans evolution microarrayCancer Genomes recognition cancer throughputyeast Structure structure genes Basednetwork Biology caenorhabditis comparative algorithmgenome Tool evolution Sequence sequencesystem Interactions complex Early expressedexpression Domains gene Support spectrometrygenomic Role 1 Discovery databaseactivity Cell induced Proteins differentiallyscreening New nuclear machine identifyingspecific Length strand during pcr
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10transcription Expression analysis Gene genomegenomic Gene protein genetic usingevolutionary analysis networks New geneprediction data interaction system fromfactor using methods metabolism widesites genes based chromosome sequencesanalysis microarray genomics zebrafish datadna control web functional methodcoli from genome annotation largegene human biology open wholegenome cell genetic associated rnaescherichia C biological integrating diseaseacid case hiv reveals networkscopy assessment systems life pathwaysnumber size sequence bacteria shortbinding network data transcriptome drosophilaorganization multiple bayesian among scaleevolution quality tool loss alternativeestimation transcriptional structure mammalian regionsarabidopsis cells approach microarrays development
Results and Discussion (Cont’d) -Topic Modeling
Results and Discussion (Cont’d)
Relationship between a paper and its citation
Results and Discussion (Cont’d)
Publication productivity by year
Results and Discussion (Cont’d)
Relationship between an author and the number of citations received
Results and Discussion (Cont’d)• Important Articles Identified by PageRank
Rank Title Journal Title
1Gapped BLAST and PSI-BLAST: A new generation of protein database search programs
Nucleic Acids Res
2 Basic local alignment search tool J Mol Biol
3Gene ontology: tool for the unification of biology. The Gene Ontology Consortium
Nat Genet
4
CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Res
5R: A language and environment for statistical computing
Book
6 Initial sequencing and analysis of the human genome Nature
7Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
Science
8 The Protein Data Bank Nucleic Acids Res
9Bioconductor: open software development for computational biology and bioinformatics
Genome Biology
10Exploration, normalization, and summaries of high density oligonucleotide array probe level data
Biostatistics
Results and Discussion (Cont’d)
Research productivity by country
Results and Discussion (Cont’d)
Research Productivity by Institute
Institute Frequency University of California 1678 Harvard Medical School 811 Stanford 768 National Institutes of Health 430 University of Washington 400 Yale University 373 University College London 329 Massachusetts Institute of Technology 310 Washington University 290 University of Toronto 287 Wellcome Trust Genome Campus 256 University of Illinois 252 University of Oxford 248 University of Michigan 240 University of Cambridge 236 University of North Carolina 235 Princeton University 234 Baylor College of Medicine 230 Columbia University 229 Cornell University 227
Visualization of author co-citation analysis
All author-based co-citation analysis
First author-based co-citation analysis
Summary and Future Work• We analyzed the field of Bioinformatics by
mining the full-text articles available in PubMed Central with Text Mining techniques
• We identified that Bioinformatics has grown very fast and collaboration among authors widely spreads out cross the disciplines.
Future work• Identify research trends over time
• Combining community detection and topic modeling• Author co-citation analysis vs. author collabo-
ration analysis• All author-based vs. first author-based vs. important
contributor-based• Compare to Web of Science data
References• Nagarajan M., Mohamed Idhris L., Chellappandi P., Kumaravel J.P.S.
and Premalatha. V. Information Use by Scholars in Bioinformatics: A Bibliometric View, 2011
International Conference on Information Communication and Man-agement IPCSIT vol.16 (2011)
• Church, K., and Hanks, P., Word Association Norms, Mutual Informa-tion and Lexicography, Computational Linguistics, Vol 16:1, pp. 22-29, (1991).
• Patra, S K, Mishram S. (2006), Bibliometric study of bioinformatics lit-erature,
Scientometrics, 67 : 477–489.
• Zhao, D. (2006) Towards All-Author Co-Citation Analysis, Information Processing and Management, 42: 1578-1591
• Butler, L. (2006) RQF Pilot Study Project – History and Political Science Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007.
• Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005.
• Brusic, V. (2007) The growth of bioinformatics, Briefings in Bioinfor-matics. VOL 8. NO 2. 69-70
References
• Bansard Y, Rebholz-Schuhmann D, Cameron G, Clark D, van Mulli-gen E, Beltrame E, Barbolla E, Hoyo D., Martin-Sanchez H, Mi-lanesi L, Tollis I, van der Lei J, Coatrieux J L: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on in-formation technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3): 237-243
• Perez-Iratxeta C, Andrade-Navarro M A, Wren J D: Evolving re-search trends in bioinformatics. Briefings in Bioinformatics 2007, 8(2): 88-95.
• Glänzel W, Janssens F, Thijs B: A comparative analysis of publica-tion activity and citation impact based on the core literature in bioinformatics.
Scientometrics 2009, 79:109-129.• Blei, D., Ng A., and Jordan, M. Latent Dirichlet allocation. Journal
of Machine Learning Research, 3:993{1022, January 2003.• Huang H, Andrews J, Tang J: Citation characterization and impact
normalization in bioinformatics journals. Journal of the American Society of Information
Science and Technology 2011, doi: 10.1002/asi.21707
Questions?• Thank you!
Questions?
Thank You!