Top Banner
Mining Medical Mountains: How Bioinformatics Can Help Medical Science David Wishart University of Alberta
51
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Grandrounds2004.ppt

Mining Medical Mountains: How Bioinformatics Can Help

Medical Science

David Wishart

University of Alberta

Page 2: Grandrounds2004.ppt

The Library of Congress

• 120 million items in storage• 54 million manuscripts• 18 million books• 12 million photographs• 4.5 million maps• 4.4 million technical reports• 1.1 million PhD dissertations• ~20 Terabytes of data

Page 3: Grandrounds2004.ppt

Some Numbers…• 3 scientific journals in 1750• 120,000 scientific journals today• 500,000 medical articles/year• 4,000,000 scientific articles/year• 14,000,000 abstracts in PubMed derived from

4600 journals• 3,307,998,701 web pages on Google• 500,000,000,000,000 bytes on the Web

Page 4: Grandrounds2004.ppt

Some Numbers…

• A researcher would have to scan 130 different journals and read 27 papers per day to follow a single disease, such as breast cancer.

• Baasiri, R.A., Glasser, S.R., Steffen, D.L. & Wheeler, D.A. Oncogene 18, 7958-7965 (1999)

Page 5: Grandrounds2004.ppt

Some Graphs:

Page 6: Grandrounds2004.ppt

Multiplexed CE with Fluorescent detection

ABI 3700 96x700 bases

Page 7: Grandrounds2004.ppt

Genomes• 5 vertebrates (human, mouse, rat, fugu)

• 2 plants (arabadopsis, rice)• 2 insects (fruit fly, mosquito)• 2 nematodes (C. elegans, C. briggsae)• 1 sea squirt• 4 parasites (plasmodium, guillardia)• 4 fungi (S. cerevisae, S. pombe)• 140 bacteria and archebacteria• 1000+ viruses

Page 8: Grandrounds2004.ppt

The Human Genome

• 3.2 billion bases on 24 chromosomes

• 3,201,762,515 bases sequenced (99%)

• 23,531 - 31,609 genes (predicted)

• 50,000+ named genes (synonyms)

• 4000+ human diseases

• 850-1039 disease causing genes (ID’s)

Page 9: Grandrounds2004.ppt

A Tidal Wave of Data

Made worse by….

Page 10: Grandrounds2004.ppt

The Language of Biology

• The EGF receptor binds epidermal growth factor which triggers the phosphorylation of PLC-gamma followed by the binding and subsequent phosphorylation of Grb2 and SOS which leads to the formation of a Raf1-MEK complex which, in turn, leads to a p21ras auto-phosphorylation cascade. The complex then phosphorylates a MAP kinase which is transported to the nucleus via a nuclear transport signal which triggers the transcription of c-Fos, c-Myc and c-Jun which upon release in the rough ER are transported to…

Page 11: Grandrounds2004.ppt

How To Make Sense of This?

• How to acquire biological or medical knowledge from English text?

• How to build facts and relationships from scientific/medical articles?

• How to put 100+ years of useful data into readily accessible electronic repositories (the back fill problem)?

Page 12: Grandrounds2004.ppt

Some Solutions

• Text Mining…

• Create electronic repositories of abstracts and articles (PubMed/Entrez)

• Create glossaries & thesaurus’ of terms• Employ machine learning methods to parse

electronic text to extract or interpret key pieces of “atomic” information (SVM, Naïve Bayes, Reference Point Logistics, etc.)

Page 13: Grandrounds2004.ppt

PubMed

http://www.ncbi.nlm.nih.gov/PubMed/

Page 14: Grandrounds2004.ppt

PubMed• Allows users to search by journal, key

words, titles etc.

• Uses MeSH (Medical SubHeadings) to allow automated search of synonyms (renal transplant = kidney transplantation)

• API available to query PubMed automatically and remotely

• Few users know how to use PubMed properly or to its full extent

Page 15: Grandrounds2004.ppt

“ouellette bf” [au] AND yeast

Details

Page 16: Grandrounds2004.ppt
Page 17: Grandrounds2004.ppt

MeSH: Medical Subject Heading

("ouellette bf"[au] AND (("yeasts"[MeSH Terms] OR "saccharomyces cerevisiae"[MeSH Terms]) OR yeast[Text Word]))

Page 18: Grandrounds2004.ppt

Integrated Text/Sequence Searching with Entrez

Page 19: Grandrounds2004.ppt

PubCrawler

http://www.pubcrawler.ie/

Page 20: Grandrounds2004.ppt

PubCrawler• Free "alerting" service that scans daily

updates to the NCBI Medline (PubMed) and GenBank databases

• Lists new database entries that match search parameters (keywords, author names, etc.) specified by the user

• Results are presented as an HTML Web page (Entrez-like format)

• Can be downloaded or run as a service

Page 21: Grandrounds2004.ppt
Page 22: Grandrounds2004.ppt
Page 23: Grandrounds2004.ppt

MedMiner

http://discover.nci.nih.gov/textmining/filters.html

Page 24: Grandrounds2004.ppt

MedMiner

• A text miner that filters, extracts and organizes relevant sentences in the literature based on a gene, gene-gene or gene-drug query

• Combines GeneCards and PubMed searches with an integrated text filter

• L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein, (1999) BioTechniques 27:1210-1217.

Page 25: Grandrounds2004.ppt
Page 26: Grandrounds2004.ppt

MedGene

http://hipseq.med.harvard.edu/MEDGENE/login.jsp

Page 27: Grandrounds2004.ppt

MedGene• A list of human genes associated with a

particular human disease in ranking order • A list of human genes associated with multiple

human diseases in ranking order • A list of human diseases associated with a

particular human gene in ranking order • A list of human genes associated with a

particular human gene in ranking order• The sorted gene list from other disease related

high-throughput experiments, (i.e. micro-array

Page 28: Grandrounds2004.ppt
Page 29: Grandrounds2004.ppt

MedGene Performance

• Was able to identify >2400 genes associated with breast cancer in the literature

• Existing databases only list 260 genes (of which MedGene found 240)

• Could save ~100’s of hours of literature searching & combing

Page 30: Grandrounds2004.ppt

PolySearch

Page 31: Grandrounds2004.ppt

PolySearch

• Searches over 14 million PubMed Records

• Searches against 1622 diseases (and synonyms)

• Searches using 9300 genes with 42,500 synonyms

• Assesses quality using SCI list of impact factors for 8600+ journals

Page 32: Grandrounds2004.ppt

PolySearch• Supports PubMed text searching for gene &

disease associations (user provides disease name)

• Automatically scores & ID’s genes and searches for known SNPs or mutations against std. databases

• Grabs gene sequences and generates primers around SNPs

• Archives (MySQL database) or sends results as HTML page to user

Page 33: Grandrounds2004.ppt

Other Examples of Text or Web Mining

Page 34: Grandrounds2004.ppt

http://textomy.iit.nrc.ca/

Page 35: Grandrounds2004.ppt

Pre-BIND

• Donaldson et al. BMC Bioinformatics 2003 4:11

• Used Support Vector Machine (SVM) to scan literature for protein interactions

• Precision, accuracy and recall of 92% for correctly classifying PI abstracts

• Estimated to capture 60% of all abstracted protein interactions for a given organism

Page 36: Grandrounds2004.ppt

Proteome Analyst

• Uses Naïve Bayes methods in combination with sequence homology to identify “tokens” or nuggets of important information from text (titles, keywords, InterPro numbers and other data)

• Produces quantitative estimates (queryable reliability scores) of protein function, location, etc.

Page 37: Grandrounds2004.ppt

GenePublisher

• Processes raw genechip data and produces a publishable report in 1-2 hours of processor time

• Mines existing databases to build up or extract relationships

• Learns from previous analyses and remembers previous associations

http://www.cbs.dtu.dk/services/GenePublisher/

Page 38: Grandrounds2004.ppt

GenePublisher Output

Page 39: Grandrounds2004.ppt

Continuing Problems in Text Mining Biomedical

Literature are…

Page 40: Grandrounds2004.ppt

A Serious Naming Problem

• Sonic Hedgehog• Draculin• Profilactin• Knobhead• Lunatic Fringe• Fidgetin• Mortalin• Antiquitin• Accelerin

• Cockeye• Clootie Dumpling• SnaFu• Gleeful• Bang Senseless• Bride of Sevenless• Crack• Christmas Factor• Orphanin

Page 41: Grandrounds2004.ppt

And Exotic Terminology…

• J. Med. Genetics 10, 1962-6 (1973) "Mobius Syndrome with Poland’s Anomaly.“

• Heavy use of Eponyms (Werner’s syndrome, Down’s syndrome, Angelman’s syndrome, Creutzfeld-Jacob disease, etc. etc.)

Page 42: Grandrounds2004.ppt

Some Challenges

• How to name or describe proteins, genes, drugs, diseases and conditions consistently and coherently?

• How to ascribe and name a function, process or location consistently?

• How to describe interactions, partners, reactions and complexes?

• How to classify genes & proteins (a universal taxonomy of sequences and structures)?

Page 43: Grandrounds2004.ppt

Some Solutions

• Develop controlled or restricted vocabularies (IUPAC-like naming conventions)

• Create thesaurus’, central repositories or synonym lists (MeSH terms in PubMed)

• Work towards synoptic reporting and structured abstracting

Page 44: Grandrounds2004.ppt

Synoptic or Structured Abstract

J Am Acad Dermatol. 2004 Mar;50(3):431-4. Related Articles, Links

Demand outstrips supply of US pediatric dermatologists: Results from a national survey.

Hester EJ, McNealy KM, Kelloff JN, Diaz PH, Weston WL, Morelli JG, Dellavalle RP.

BACKGROUND: The US pediatric dermatology workforce was last examined in 1986 when limited employment

opportunity was found. OBJECTIVE: We sought to re-examine pediatric dermatology workforce issues. METHODS:

US dermatology chairpersons and residency program directors were surveyed for: (1) agreement with pediatric

dermatology workforce statements; and (2) pediatric dermatology faculty and fellow numbers. RESULTS: Respondents

agreed that having a pediatric dermatologist or dermatologists on faculty is important, and that a shortage of pediatric

dermatologists exists, but did not agree that increasing pediatric dermatology training requirements will increase this

shortage. Almost half of the programs (45/94) employed a full-time pediatric dermatologist, and 24 programs had

currently been recruiting a pediatric dermatologist for more than 1 year. Only 6 pediatric dermatology fellows were

in training. CONCLUSION: Given that open pediatric dermatology faculty positions greatly exceed the number of

fellows in training and that formal training requirements will be increasing, the shortage of pediatric dermatologists

will likely continue.

Page 45: Grandrounds2004.ppt

GO-Gene Ontology

• To produce a controlled vocabulary that changes as biological knowledge changes

• Categorizes according to 1) molecular function; 2) biological process; and 3) cellular component

• Represents contributions and consensus opinions from multiple experts in various fields

• Aim is to have every known protein and gene annotated consistently

http://www.geneontology.org/

Page 46: Grandrounds2004.ppt

NIH’s Medical Ontology Research Program

http://lhncbc.nlm.nih.gov/lhc/servlet/Turbine/template/home%2CHome.vm

Page 47: Grandrounds2004.ppt

MeSH

Page 48: Grandrounds2004.ppt

OMIM

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

Page 49: Grandrounds2004.ppt

DrugBank

http://redpoll.pharmacy.ualberta.ca/drugbank/

Page 50: Grandrounds2004.ppt

Bioinformatics

Medinformatics

Page 51: Grandrounds2004.ppt

Conquering the Mountain