Top Banner
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri
27

Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

Dec 28, 2015

Download

Documents

Gavin Rose
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies

Joyce A. Mitchell, Ph.D.

National Library of Medicine

University of Missouri

Page 2: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

2

Research Collaborators

Olivier Bodenreider, M.D., Ph.D. Alexa T. McCray, Ph.D. Allen C. Browne

Page 3: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

3

Research Goals

Investigating methods of connecting the disease and genomic information.

Overall goals are to:– Overcome difficulties traversing multiple information

resources– Examine coverage of Unified Medical Language System®

(UMLS®), Gene OntologyTM (GO), LocusLink-OMIM– Develop methods to use ontologies more effectively– Present data in understandable manner

Page 4: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

4

Background – UMLS

NLM developed, maintains Purpose: facilitate retrieval & integration of

information from multiple biomedical sources Interrelates 60 biomedical terminologies

– MeSH, SNOMED, Read Codes, ICD, etc– No vocabulary focused on molecular biology

1.5 million English terms; 800,000 concepts; 134 Semantic Types; 54 Semantic Relationships

Page 5: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

5

Background – Gene Ontology

GO Consortium developed, maintains Purpose:

– promoting cross-species methodologies for functional comparisions– Allows annotation of molecular information on genes, gene products– “an essential start to creating a shared language of biology” **

Focused on – molecular function (5626 terms)– biological processes (4677 terms)– cellular components (1077 terms)

Two semantic relations (is-a and part-of)

**Genome Research 2001; 11:1425-33.

Page 6: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

6

Background - LocusLink

Curated, gene-centered resource of National Center for Biotechnology Information (NLM)

Gene names, gene product names, gene product functions, and reference sequences (DNA, RNA, protein)

Associates phenotype (diseases) to the genotype via Online Mendelian Inheritance in Man (OMIM)

Online links to major bioinformatics knowledge bases and the literature

Page 7: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

7

Specific Questions

This study looked at coverage in UMLS of1. 1244 genes associated with human diseases

2. 1702 diseases associated with the genes

3. 11,380 Gene Ontology terms

4. 38,832 genes/gene products in GO database (141,071 names)

5. Associations of genes and their functions in UMLS

6. Representation of gene function in GO compared to the UMLS

Page 8: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

8

Methods

LocusLink query: – human genes whose sequence is known and associated

with disease (1244 loci) LocusLink data:

– Genes/gene products (official names, synonyms, symbols)– Phenotypes (diseases) (1702 diseases)

GO data: – all concepts (ontology terms), excluding obsolete terms

(11,380 terms)– Gene products from all species (134,646 unique names,

38,832 genes)

Page 9: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

9

Methods

LocusLink and GO terms mapped to UMLS concepts – normalization used– mappings constrained by semantic type

LocusLink loci studied for relationships in UMLS– Gene/GP – phenotype – Gene/GP – molecular function– Gene/GP – biological process– Gene/GP – cellular component

For specific genes compared annotations in GO to representation in UMLS

Page 10: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

10

Results - 1

For 1244 genes from LocusLink– 18% found in the UMLS

Official gene name 20% 244/1244

Official gene symbol 16% 200/1244

Alias symbol 15% 394/2669

Gene product 18% 266/1460

Preferred product 18% 266/1460

Alias protein 24% 339/1425

Page 11: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

11

Results - 2

For 1702 phenotypes (diseases) corresponding to 1244 genes– 34% found in the UMLS (575/1244)

Most frequent single gene diseases covered– Huntington Disease– Cystic Fibrosis– Marfan Syndrome– Phenylketonuria– Achondroplasia

Page 12: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

12

Results - 3

GO terms found in MeSH 2764 terms GO terms found in SNOMED 1366 terms

GO terms found overall: 27% 3062/11,380

Molecular function 44% 2435/5626

Biological process 5% 256/4677

Cellular component 35% 370/1077

Page 13: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

13

Results - 4

For 134,646 unique gene names in GO database

Full name 11% 4392/38,832

Symbol 2% 1167/60,381

Synonym 6% 1964/35,433

Page 14: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

14

Results - 5

LocusLink – UMLS Relationship Categories found overall: 72%

Genes

&

gene products

Phenotype 64% 754/1182

M. Function 85% 1192/1409

B. Process 61% 762/1240

C. component 76% 841/1107

Page 15: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

15

Results - 5

Type of Relationship Associative 613 Co-occurrence 3353 Hierarchical 1168G/GP and Assoc Co-oc Hier

Phenotype 275 724 5

M. Function 206 1069 933

B. Process 57 737 147

C. Component 75 823 83

Page 16: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

16

Results - 6

Representation of gene function in GO compared to the UMLS

Page 17: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

17

Neurofibromin 2 – merlin in GO

Page 18: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

18

GeneOntology

CellularComponent

Biologicalprocess

MolecularFunction

Cell

Membrane IntracellularCell growth and/or

maintenance

CytoplasmPlasma

MembraneCell

ProliferationObsolete

Negative control ofcell proliferation

StructuralProtein

TumorSuppressor

Cytoskeleton

MERL_HUMAN

Page 19: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

19

Proteins

Neoplasm Proteins Cell Cycle Proteins Proteins by Body Part

Tumor Suppressor Proteins Membrane Proteins

Neurofibromin 2

Growth SuppresorProteins

Merlin, Drosophila

Page 20: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

20

Discussion

Page 21: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

21

Best & Worst Mappings

Best mapping categories Molecular function (GO) 44% Cellular component (GO) 35% Phenotype (LL) 34%

Worst mapping categories Gene synonym (GO) 6% Biological process (GO) 5% Gene symbol(GO) 2%

Page 22: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

22

Only 34% of diseases?

In OMIM-LL, diseases are subdivided by genetic causes but not in UMLS

E.g. Limb Girdle Muscular DystrophyLGMD is represented in UMLS A SNOMED term in MeSH it is an entry term for muscular dystrophies MeSH notes for MD: A general term for a group of

inherited disorders which are characterized by progressive degeneration of skeletal muscles (ed, 2000)

Page 23: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

23

Limb Girdle Muscular Dystrophy – genetic types

LGMD type Gene Name LGMD type Gene Name

1A Myotilin 2C Sarcoglycan-gamma

1B Lamin A/C 2D Sarcoglycan-alpha

1C Caveolin-3 2E Sarcoglycan-beta

1D Unknown 2F Sarcoglycan-delta

2A Calpain-3 2G Telethonin

2B Dysferlin 2H TRIM32

2I Fukutin-related protein

Page 24: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

24

Only 5% of Biological Processes?

Only 256 of the biological processes mapped to terms in UMLS. In GO, processes are elaborated & organism specific Example: UMLS - Mitotic spindle GO

– Mitotic spindle assembly– Mitotic spindle assembly (sensu Saccharomyces)– Mitotic spindle assembly (sensu Fungi)– Mitotic spindle checkpoint– Mitotic spindle elongation– Mitotic spindle orientation– Mitotic spindle positioning– Mitotic spindle positioning and orientation

Page 25: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

25

Why so few gene names and synonyms mapped?

Official gene names have metadata and comments. – dystrophin (muscular dystrophy, Duchenne and Becker types),

includes DXS143, DXS164, DXS206, DXS230, DXS239, DXS 268, DXS269, DXS270 DXS272

No single source has all names and synonyms GO synonym field contains IPI number for well

known genes, does not match UMLS (useful cross reference but not a synonym)

Symbols are short acronyms and match poorly

Page 26: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

26

Summary 1

UMLS needs improvement in molecular biology domain but has considerable content:– 27% of GO concepts map – 34% of single gene diseases– Existing UMLS terms come primarily from MeSH

and SNOMED

Overall, positive mapping for 13,000 terms

Page 27: Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.

27

Summary continued

If the terms are in UMLS, it is possible to find a relationship between genes and phenotypes and gene function much of the time.

UMLS does better with the human genes (20%+) than with genes from all organisms (11%)

UMLS and GO representations complement each other.