Top Banner
P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08
20

P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

P-POD

The Princeton Protein Orthology Database

Literature Discussion

Tim Hulsen 2008-05-08

Page 2: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 2

P-POD - Manuscript

The Princeton Protein Orthology Database (P-POD): a comparative genomics analysis tool for biologists

Heinicke S1,*, Livstone MS1,*, Lu C1,*, Oughtred R1,*, Kang F1, Angiuoli SV2,3, White O2, Botstein D1, Dolinski K1

PLoS ONE. 2007 Aug 22; 2(1): e766

PubMed ID 17712414

1 Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America.

2 The Institute for Genomic Research, Rockville, Maryland, United States of America3 Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United

States of America

* These authors contributed equally to this work

Page 3: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 3

P-POD - Introduction

• Existing: many biological databases that provide comparative genomics information and tools

• None of these combine results from multiple comparative genomics methods with manually curated information from the literature

• P-POD: Princeton Protein Orthology Database:– Visualizes phylogenetic relationships among predicted orthologs– Shows the orthologs in a wider evolutionary context– Contains experimental results manually collected from the literature, that

can be compared to the computational analyses– Shows links to relevant human disease and gene information via the

OMIM, model organism and sequence database

Page 4: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 4

P-POD – Ortholog methods

• Orthology is determined using OrthoMCL:– Can be run on multiple species at once– One of the better performing algorithms in terms of sensitivity and specificity

(Alexeyenko et al., 2006 and Chen et al., 2007)

• Evolutionary context is determined using Jaccard:– Clustering algorithm to find related proteins– Larger groups than just orthologs– Manuscript in preparation

Page 5: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 5

P-POD – Covered species

P-POD contains 8 species:

• Plasmodium falciparum

• Homo sapiens

• Drosophila melanogaster

• Mus musculus

• Arabidopsis thaliana

• Caenorhabditis elegans

• Danio rerio

• Saccharomyces cerevisiae

Most widely studied organisms, from a wide evolutionary range

Page 6: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 6

P-POD – Source Species Databases

Page 7: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 7

P-POD – Supported identifiers

OrganismSource Database

Valid gene/protein identifier(s) Examples

P.falciparum PlasmoDB PlasmoDB ID PF08_0034

H.sapiens ENSEMBL ENSEMBL peptide ID, peptide name ENSP00000266970, CDK2

D.melanogaster FlyBase FlyBase ID CG17520-PA, CkIIalpha-PA

M.musculus ENSEMBL ENSEMBL peptide ID ENSMUSP00000068896

A.thaliana TAIR TAIR identifier or gene name AT1G25490.1, PAB4

C.elegans WormBase WormBase identifier or gene name C09G4.1, dbr-1

D.rerio ENSEMBL ENSEMBL peptide ID, ZFIN IDENSDARP00000007117, ZDB-GENE-040808-

60

S.cerevisiae SGD ORF name or gene name YNL098C, DPM1

+ OMIM IDs

Page 8: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 8

P-POD – Orthology and clustering numbers

• 25,271 OrthoMCL families• 15,050 Jaccard Clustering families

• 165,970 proteins (154,736 OrthoMCL and 152,799 Jaccard)

• 984 families containing proteins in all species (‘omnipresent’)

• 112 families with exactly one protein in each of the 8 species: involved in core biological processes, such as:– Translation– Transport– Cell cycle regulation– Cytoskeleton organization

Page 9: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 9

P-POD – Proteins in families, and orphans

• Relatively low percentages of orphans (<=13%, except for S. cerevisiae and P. falciparum)

• These numbers confirm the high conservation of proteins across eukaryotes, with the notable exception the Plasmodium outlier

• Yeast: complete protein set used, including 800 ORFS flagged as “Dubious” by SGD. If these are excluded, the percentage of orphans drops to 20%

Page 10: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 10

P-POD – Compared to other orthology databases

2

2

2

2

1

1

1

1

4

Tot.

Page 11: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 11

P-POD - Pipeline

Page 12: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 12

P-POD – Pipeline Components

[4] Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189[5] Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673–4680[29] Samuel Lattimore B, van Dongen S, Crabbe MJ (2005) GeneMCL in microarray analysis. Comput Biol Chem 29: 354–359

Page 13: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 13

P-POD – The Database

• P-POD uses the Generic Model Organism Database (GMOD) database package using PostgreSQL software

• GMOD is the Generic Model Organism Database project, a collection of open source software tools for creating and managing genome-scale biological databases. You can use it to create a small laboratory database of genome annotations, or a large web-accessible community database. GMOD tools are in use at many large and small community databases

• Other popular GMOD tools are Apollo (Genome annotation editor), Gbrowse (Genome annotation viewer), Cmap (Comparative map viewer), Sybil (Comparative genome viewer), Chado (Biological database schema) and BioMart (Data mining system)

Page 14: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 14

P-POD - Web Interface (1)

• The web interface allows users to search and browse the data in several ways

• Results can be queried by various peptide identifiers or gene names

• Searches generate result pages that contain:– a hyperlinked phylogenetic tree of predicted orthologs generated by

OrthoMCL or of more distantly-related proteins generated by Jaccard clustering

– a list of diseases and genes associated with the human ortholog(s) as documented in OMIM

– a manually curated list of papers with cross-complementation experiments involving the yeast ortholog(s), from SGD database

– a downloadable ClustalW alignment of family members

• Web address: http://ortholog.princeton.edu

Page 15: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 15

P-POD –WebInterface(2)

OrthoMCL

OMIM

CLUSTALW

SGD Lit.

INPUT

Page 16: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 16

P-POD –WebInterface(3)

SGD Lit.

CLUSTALW

JACCARD

Page 17: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 17

P-POD – Comparison of methods

• Orthology/clustering methods OrthoMCL and Jaccard can be compared using P-POD

• Jaccard is far more inclusive than OrthoMCL

• Shown at the right: OrthoMCL family of the alpha tubulins. It contains only the alpha tubulins, while the Jaccard family contains the alpha, beta, and gamma tubulins

Page 18: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 18

P-POD – Discussion (1)

• P-POD shows direct orthology (OrthoMCL) and broader evolutionary clustering (Jaccard)

• P-POD uses a generic, modular database schema (GMOD) in combination with a freely available database system (PostgreSQL)

• P-POD provides experimental evidence of conservation curated from the primary literature

• Three sets of users:– Molecular biologists that query the database over the web to browse orthology

data for their favorite proteins– Model organism database developers, who will quickly be able to provide

comparative genomics tools with their species of interest by implementing our

system– Computational biologists who are developing novel comparative genomics

algorithms will find the curated information and computational data from other

methods extremely useful in assessing their approach

Page 19: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 19

P-POD – Discussion (2)

• P-POD can be downloaded in its entirety for installation on one’s own system

• Software developers can use the P-POD database infrastructure when developing their own comparative genomics resources and database tools

Page 20: P-POD The Princeton Protein Orthology Database Literature Discussion Tim Hulsen 2008-05-08.

04/18/23 |P-POD | 20

P-POD – Future plans

• Provide regular updates to the data contained within the database

• Add new features to the web interface

• Expand upon the amount of data stored within the database

• Provide curated literature describing experimental confirmation of orthology

• Include literature from other species than just S. cerevisiae

• As more refined methods for automatic detection of orthology are developed, they can be incorporated into the P-POD tool, taking advantage of the modular design scheme