Applications of Text and Data Mining of biomedical databases Miguel Andrade Computational Biology & Data Mining group Max Delbrück Center for Molecular Medicine [email protected]
Applications of Text and
Data Mining of biomedical
databases
Miguel Andrade Computational Biology & Data Mining group Max Delbrück Center for Molecular Medicine
Gene structures
Gene expression
Protein sequences
Protein databases
Literature databases
Biological predictions
Human disease
001001001000100101011110110110
AGCTGGTACGAAGATGTCTCGCA
MLVPIEKAEVPRYILKTEFRKAILTS
In a phosphorylation dependent ma
Protein and nucleotide sequences (UniProt, Entrez),
Protein domains (PFAM, SMART), Structures (PDB), Diseases (OMIM),
Gene expression (GEO),
Bibliography (records, MEDLINE)
(full text, PubMed Central)
Molecular Biology databases
Compressed PubMed in XML: 17GB 23M items (exhaustive back to 1966, oldest from 1809) PubMed Central open access subset 26GB of raw XML files (text only), compressed 8GB. 2.6M items
Bibliography (records, MEDLINE)
(full text, PubMed Central)
Molecular Biology databases
Compressed PubMed in XML: 17GB 23M items (exhaustive back to 1966, oldest from 1809) PubMed Central open access subset 26GB of raw XML files (text only), compressed 8GB. 2.6M items 1 Human Genome 320GB
Bibliography (records, MEDLINE)
(full text, PubMed Central)
Molecular Biology databases
MEDLINE
Entrez Gene
UniProt
KW
authors words MeSH
GO
GO
NetAffx
GO UniGene
ProDom
PDB
GO
fold OMIM words
GEO
Rank MEDLINE according to a topic
Fontaine et al. (2009) Nucleic Acids Research
Jean-Fred Fontaine
http://cbdm.mdc-berlin.de/tools/medlineranker/
MedlineRanker
Génie
http://cbdm.mdc-berlin.de/tools/genie/
Ranks a set of genes from a whole genome according to a topic
Fontaine et al. (2011) Nucleic Acids Research
Human
PESCADOR
http://cbdm.mdc-berlin.de/tools/pescador/
Extract interactions and filter by concepts
Barbosa-Silva et al. (2010) BMC Bioinformatics
Adriano Barbosa
Barbosa-Silva et al. (2011) BMC Bioinformatics
Co-occurrences types PESCADOR
co-occurrence in abstract
Type 4
Type 3
Term + Term
Type 2
[Biointeraction] +Term + Term + [Biointeraction]
Type 1
Term + [Biointeraction] + Term
Worldwide scientific publishing activity
Perez-Iratxeta and Andrade
(2002) Science
Approximate amount of publications for the years 1996–2001 per million
inhabitants by country:
10,000 100
1,000 10 1
Ratio publications for 1996–2001 / 1989–95
Worldwide scientific publishing activity
+++ -
++ = --
+ ---
Perez-Iratxeta and Andrade
(2002) Science
Find referees
peer2ref
Andrade-Navarro et al (2012) BioData Mining
Carolina Perez-Iratxeta
(OHRI-Ottawa)
http://www.ogic.ca/peer2ref/
Andrade-Navarro et al (2012) BioData Mining
Find referees
peer2ref
Carolina Perez-Iratxeta
(OHRI-Ottawa)
http://www.ogic.ca/peer2ref/
Gareth Palidwor (OHRI-Ottawa)
http://www.ogic.ca/mltrends/
Graph historical term usage in MEDLINE
MLTrends
Palidwor and Andrade-Navarro (2010) Journal of Biomedical Discovery and Collaboration
Graph historical term usage in MEDLINE
MLTrends
Gareth Palidwor (OHRI-Ottawa)
http://www.ogic.ca/mltrends/
Palidwor and Andrade-Navarro (2010) Journal of Biomedical Discovery and Collaboration