텍스트마이닝 기법들을 통한 생물정보학분야의 이해 (Detecting Bioinformatics by Text Mining Techniques) Min Song, PhD Associate Professor Department of Library

텍스트마이닝 기법들을 통한 생물정보학분야의 이해 (De-tecting Bioinformatics by Text Mining Techniques)

Min Song, PhDAssociate Professor

Department of Library andInformation Science

Yonsei University

Outline• Introduction and Background• Research Problem• Methods

• Data Processing• Topic Modeling• Citation Analysis• Identification of Important Articles by PageRank• Visualization

• Results & Discussion• Summary & Future Work

Introduction

• Bioinformatics has grown into the cross-disciplinary field and proliferated into new areas of life Sciences• 400,000 biological researchers – worldwide• sequencing industry to grow from $1.5B to $100B in 20 years (NextGen Informatics, 2011) • Increasing number of biological databases including PubMed and PubMed

Central

• Understanding the trends in and the structure ofBioinformatics is increasingly important

• Bibliometric analysis has been applied to Bioin-formaticsfor this purpose (Glänzel et al., 2009; Bansard et al., 2007; Huang et al., 2010)

Research Problem• Bibliometric analysis utilizes quantitative

analysis and statistics to describe patterns of publication within a given field or body of literature (Osareh, 1996)

• Problems of Current Approaches

• The current bibliometric analysis relies primarily on Thomson’s Web of Science product which results in the following problems:

• Manually processing citation data

• Incomplete coverage

• Only use citation analysis• Can’t handle big data

Image Reference: Wolfgang Glnzel, and Bart De Moor, Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis, pg. 361, 368, KDD '07. ACM, San Jose, CA, August 2007.

Goal

• Detecting the trends in and the structure of the field ofBioinformatics• We introduce novel techniques to detect the

knowledge structure of and trends in Bioinformatics by Text Mining techniques and automated citation analysis

• Mining PubMed Central full-text with • topic modeling• word co-occurrence• named entity recognition• MeSH

• Author co-citation analysis• Visualization

What is PubMed Central?• PubMed Central (PMC) is the U.S. National

Library of Medicine's digital archive of biomedical and life sciences journal literature

• Provides free and unrestricted access (XML format)

• Integrates journal literature with other valuable information resources in the NCBI database family (e.g., PubMed, Nucleotide, Protein)

• Launched in February 2000 • 383 journals, 1,512,652 articles, 4.3m unique

visitors in April 2008

Citation Analysis

• Citation Graphs• Link-based algorithms

• PageRankRepresentative Publications

Text-based

Co-citation

Citation-based

Documents

QUANTIFY SIMILARITIES

Term co-occur-

rence

Topic model-ing

Biblio-graphic coupling

(BC)

Combine

Methods – Data Collections1. Advanced Bioinformatics 2. Algorithms for Molecular Biology 3. Biochemistry 4. BioData Mining 5. Bioinformatics 6. Bioinformation 7. BMC Bioinformatics 8. BMC Genomics 9. BMC Systems Biology 10. Briefings in Functional Genomics &

Proteomics 11. BMC Research Notes 12. Bulletin of Mathematical Biology 13. Cancer Informatics 14. Comparative and Functional Genomics 15. EURASIP Journal on Bioinformatics and

Systems Biology 16. The EMBO Journal

17. Evolutionary Bioinformatics 18. Genome Biology 19. Genome Medicine 20. Genomics 21. Genome Integration 22. Journal of Biotechnology 23. Journal of Biomedical Semantics 24. Journal of Proteome Research 25. Journal of Proteomics 26. Journal of Computer-Aided Molecular

Design 27. Journal of Computational Neuroscience 28. Journal of Molecular Biology 29. Journal of Molecular Modelling 30. Journal of Theoretical Biology 31. Mammalian Genome 32. Molecular & Cellular Proteomics 33. Molecular Systems Biology 34. Neuroinformatics 35. Pharmacogenetics and Genomics 36. Physiological Genomics 37. PLoS Computational Biology 38. PLoS Biology 39. PLoS Genetics 40. Protein Science 41. Proteomics 42. Source Code for Biology and Medicine 43. Statistical Methods in Medical Research 44. Theoretical Biology and Medical

Modelling 45. Trends in Biochemical Sciences 46. Trends in Biotechnology 47. Trends in Genetics

Total 20,869 articles from 47 Journals

Overall Procedure of Our Ap-proach

Parse PubMed Central

C i t a t i o n

R e l a t i o n a l D B

T e x t

R e l a t i o n a l D B Text Analysis

Word cooccurrence

MeSH term frequency

Topic Modeling with LDA

Link Analysis

Ranking important articles by PageRank

Detect Organization and Country with

NER

Author Co-citation Analysis

Research Productivity Analyis

MeSH = Medical Subject Headings

Word co-occurrence analysis and MeSH term frequency• Important concept identifications by word co-

occurrence • The most widely used measure of co-occurrence is

mutual information (MI)

• We use the log-likelihood ratio (LLR) in that it is more appropriate than MI in the treatment of a mixture of high-frequency bigrams and low-frequency bigrams

• Important concept identifications by MeSH Term • Counting MeSH terms assigned to each article• MeSH terms are not assigned to PubMed Central

• Mapping from PubMed Central to PubMed record andthen extract MeSH terms

Database Schema for a PubMedCentral Citation DB

Article

PK article_id

journal_title title year issue pmid

Authorship

PK author_id

first_name last_name

RelationAuthorArticle

PK author_idPK article_id

Citation

PK citation_id

id_citation_from id_citation_to

Affiliation

PK affiliation_id

person_id article_id country organization

Fulltext

PK pmid

title journal_title abstract introduction methodology

Topic Modeling• Topic Modeling by LDA

• We are to explore the salient topics in core literature of Bioinformatics.

• We use Latent Dirichlet Allocation (LDA) proposed in (Blei et al., 2003) for topic model generation

• LDA is a generative model that enables sets of observations to be accounted for by unobserved groups which explains similarity of documents in the collection

• In LDA, each group is described as a random mix-ture

over latent topics where each topic is a discrete distribution over the vocabulary of the collection

D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.

NER-based Detection of Organization and Country• We apply a Named Entity Recognition (NER)

technique to identify country and organization from the text

Citation Analysis• Build a Citation Network from the Datasets

• 990,000 citation nodes from about 20,000 papers• Apply the PageRank algorithm to the network

to identify the important articles

Citation Network (Complexity and Social Networks, 2012)

PageRank - definition• u: a web page• Fu: set of pages u points to • Bu: set of pages that point to u• Nu=|Fu|: the number of links from u • c: a factor used for normalization

uBv vN

vRcuR

)()(

• The equation is recursive, but it may be computed by starting with any set of ranks and iterating the computation until it converges.

• The definition corresponds to the probability distribution of a random walk on the web graphs.

Results and Discussion• Term Co-occurrence Analysis

Keyword Word Co-occurrence and LLC ScoreGene gene expression - 36947.5, gene ontology - 4729.7, expressed genes -

4115.5, genes involved - 3423.9, gene regulation - 1314.1

Genome genome wide - 15485.4, whole genome - 5401.7, human genome - 2950.3, genome sequence - 1821.2, functional genomics - 1805.4

Expression expression patterns - 4231.7, expression profiles - 6517.0, expression data - 3546.1, expression levels - 3187.4

Data data sets - 6593.5, microarray data - 6305.9, expression data - 3546.1

Protein protein interaction - 4824.8, protein interactions - 3186.5, protein coding - 2841.8, protein protein - 2719.8

Algorithm clustering algorithm - 676.0, clustering algorithms - 585.0, new algorithm - 502.0, proposed algorithm - 416.7, alignment algorithms - 266.3

Database public databases - 1309.8, relational database - 1296.7, database search - 363.4

Computer computer simulations - 538.7, computer program - 317.2, computer aided - 278.6, computer science - 223.1, computational model - 221.9

Keywords with High Ranked Word Co-occurrence

Results and Discussion (Cont’d)

gene - expression 36947.5

amino - acid 16483.9

genome - wide 15485.4

high - throughput 14185.2

large - scale 10554

binding - sites 9450.1

factor - transcription 8580.7

saccharomyces - cerevisiae 7867.8

E - coli 6849.4

expression - profiles 6517

microarray - data 6305.9

expression - patterns 4231.7

expression - levels 3187.4

Top Ranked Word Pairs by LLC

Results and Discussion (Cont’d)• Out of 20,869 documents, there are 19,954 documents that have the

corresponding MEDLINE records (95.6% matching). In 19,954 documents, 8,412 documents have MeSH terms (42.2%)

MeSH Term FrequencyAnimals 5178Humans 4883Computational Biology 3070Algorithms 2980Gene Expression Profiling 2702Oligonucleotide Array Sequence Analysis 2192Software 2154Molecular Sequence Data 1868

Models, Biological 1579Computer Simulation 1568Mice 1511Sequence Analysis, DNA 1489Base Sequence 1374Genomics 1344Evolution, Molecular 1336Databases, Genetic 1325Models, Genetic 1289Sequence Alignment 1278Proteins 1135

Results and Discussion (Cont’d) -Topic Model-ing

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5identification Model human expression dataSignaling Gene cells profiling timeUsing Mapping detection regulatory informationcerevisiae Protein pathway specific proteinsaccharomyces Human protein mouse classificationGenes Structural from transcriptional fromSmall Between dna molecular highNon computational analysis dynamic analysisalignment Binding stem regulation massDna Based elegans evolution microarrayCancer Genomes recognition cancer throughputyeast Structure structure genes Basednetwork Biology caenorhabditis comparative algorithmgenome Tool evolution Sequence sequencesystem Interactions complex Early expressedexpression Domains gene Support spectrometrygenomic Role 1 Discovery databaseactivity Cell induced Proteins differentiallyscreening New nuclear machine identifyingspecific Length strand during pcr

Topic 6 Topic 7 Topic 8 Topic 9 Topic 10transcription Expression analysis Gene genomegenomic Gene protein genetic usingevolutionary analysis networks New geneprediction data interaction system fromfactor using methods metabolism widesites genes based chromosome sequencesanalysis microarray genomics zebrafish datadna control web functional methodcoli from genome annotation largegene human biology open wholegenome cell genetic associated rnaescherichia C biological integrating diseaseacid case hiv reveals networkscopy assessment systems life pathwaysnumber size sequence bacteria shortbinding network data transcriptome drosophilaorganization multiple bayesian among scaleevolution quality tool loss alternativeestimation transcriptional structure mammalian regionsarabidopsis cells approach microarrays development

Results and Discussion (Cont’d) -Topic Modeling


Relationship between a paper and its citation


Publication productivity by year


Relationship between an author and the number of citations received

Results and Discussion (Cont’d)• Important Articles Identified by PageRank

Rank Title Journal Title

1Gapped BLAST and PSI-BLAST: A new generation of protein database search programs

Nucleic Acids Res

2 Basic local alignment search tool J Mol Biol

3Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nat Genet

4

CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice

Nucleic Acids Res

5R: A language and environment for statistical computing

Book

6 Initial sequencing and analysis of the human genome Nature

7Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring

Science

8 The Protein Data Bank Nucleic Acids Res

9Bioconductor: open software development for computational biology and bioinformatics

Genome Biology

10Exploration, normalization, and summaries of high density oligonucleotide array probe level data

Biostatistics


Research productivity by country


Research Productivity by Institute

Institute Frequency University of California 1678 Harvard Medical School 811 Stanford 768 National Institutes of Health 430 University of Washington 400 Yale University 373 University College London 329 Massachusetts Institute of Technology 310 Washington University 290 University of Toronto 287 Wellcome Trust Genome Campus 256 University of Illinois 252 University of Oxford 248 University of Michigan 240 University of Cambridge 236 University of North Carolina 235 Princeton University 234 Baylor College of Medicine 230 Columbia University 229 Cornell University 227

Visualization of author co-citation analysis

All author-based co-citation analysis

First author-based co-citation analysis

Summary and Future Work• We analyzed the field of Bioinformatics by

mining the full-text articles available in PubMed Central with Text Mining techniques

• We identified that Bioinformatics has grown very fast and collaboration among authors widely spreads out cross the disciplines.

Future work• Identify research trends over time

• Combining community detection and topic modeling• Author co-citation analysis vs. author collabo-

ration analysis• All author-based vs. first author-based vs. important

contributor-based• Compare to Web of Science data

References• Nagarajan M., Mohamed Idhris L., Chellappandi P., Kumaravel J.P.S.

and Premalatha. V. Information Use by Scholars in Bioinformatics: A Bibliometric View, 2011

International Conference on Information Communication and Man-agement IPCSIT vol.16 (2011)

• Church, K., and Hanks, P., Word Association Norms, Mutual Informa-tion and Lexicography, Computational Linguistics, Vol 16:1, pp. 22-29, (1991).

• Patra, S K, Mishram S. (2006), Bibliometric study of bioinformatics lit-erature,

Scientometrics, 67 : 477–489.

• Zhao, D. (2006) Towards All-Author Co-Citation Analysis, Information Processing and Management, 42: 1578-1591

• Butler, L. (2006) RQF Pilot Study Project – History and Political Science Methodology for Citation Analysis, November 2006, accessed from: http://www.chass.org.au/papers/bibliometrics/CHASS_Methodology.pdf, 15 Jan 2007.

• Belew, R.K. (2005) Scientific impact quantity and quality: Analysis of two sources of bibliographic data, arXiv:cs.IR/0504036 v1, 11 April 2005.

• Brusic, V. (2007) The growth of bioinformatics, Briefings in Bioinfor-matics. VOL 8. NO 2. 69-70

References

• Bansard Y, Rebholz-Schuhmann D, Cameron G, Clark D, van Mulli-gen E, Beltrame E, Barbolla E, Hoyo D., Martin-Sanchez H, Mi-lanesi L, Tollis I, van der Lei J, Coatrieux J L: Medical informatics and bioinformatics: a bibliometric study. IEEE transactions on in-formation technology in biomedicine : a publication of the IEEE Engineering in Medicine and Biology Society 2007, 11(3): 237-243

• Perez-Iratxeta C, Andrade-Navarro M A, Wren J D: Evolving re-search trends in bioinformatics. Briefings in Bioinformatics 2007, 8(2): 88-95.

• Glänzel W, Janssens F, Thijs B: A comparative analysis of publica-tion activity and citation impact based on the core literature in bioinformatics.

Scientometrics 2009, 79:109-129.• Blei, D., Ng A., and Jordan, M. Latent Dirichlet allocation. Journal

of Machine Learning Research, 3:993{1022, January 2003.• Huang H, Andrews J, Tang J: Citation characterization and impact

normalization in bioinformatics journals. Journal of the American Society of Information

Science and Technology 2011, doi: 10.1002/asi.21707

Questions?• Thank you!

Questions?

Thank You!