Top Banner
Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases (“knowledge bases”) used in genome analysis
67

Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Dec 16, 2015

Download

Documents

Rodney Fisher
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Michael Y. Galperin National Center for Biotechnology InformationNational Library of MedicineNational Institutes of HealthBethesda, Maryland, USA

Databases (“knowledge bases”)

used in genome analysis

Databases (“knowledge bases”)

used in genome analysis

Page 2: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Growth in genome sequencing

Page 3: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Working Draft Sequence

gaps

Page 4: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

J. Smith - a very common name

Structure - a very common term

Glutamine amidotransferase - less common term but not

a very good descriptor

Page 5: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

A different professor Janet Smith

Another Janet Smith in the news

Page 6: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Glutamine for sale

Page 7: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

• Databases– PubMed and other NCBI databases– Biochemical databases– Protein domain databases– Structural databases– Genome comparison databases

• Tools– CDD / COGs– VAST / FSSP

Tools of trade for the “armchair scientist”

Page 8: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

• Archival or Primary Data – Text: PubMed– DNA Sequence: GenBank– Protein Sequence: Entrez Proteins, TREMBL– Protein Structures: PDB

• Curated or Processed Data– DNA sequences : RefSeq, LocusLink, OMIM– Protein Sequences: SWISS-PROT, PIR– Protein Structures : SCOP, CATH, MMDB– Genomes: Entrez Genomes, COGs

Types of databasesTypes of databases

Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases

Page 9: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

http://www.ncbi.nlm.nih.govhttp://www.ncbi.nlm.nih.gov

Page 10: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

The National Center for Biotechnology Information (NCBI)

• Created as a part of the National Library of Medicine, National Institutes of Health in 1988– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information

• Tools: BLAST(1990), Entrez (1992)• GenBank (1992)• Free MEDLINE (PubMed, 1997)• Other databases: dbEST, dbGSS, dbSTS,

MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink,

RefSeq

Page 11: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

What is GenBank?• Archival nucleotide sequence database• Sample slogans:

“Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served”

• Data are shared nightly among three collaborating databases:

• GenBank at NCBI - Bethesda, Maryland, USA• DNA Database of Japan (DDBJ) at NIG -

Mishima, Japan• European Molecular Biology LaboratoryEuropean Molecular Biology Laboratory

DatabaseDatabase (EMBL) at EBI - Hinxton, UK

Page 12: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Some guiding principles of working with GenBank

• GenBank is a nucleotide-centric view of the information space

• GenBank is a repository of all publically available sequences

• In GenBank, records are grouped for various reasons

• Data in GenBank is only as good as what you put in

Page 13: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

NCBI databases and their linksNCBI databases and their links

Word WeightWord Weight

VASTVAST

BLASTBLASTBLASTBLAST

PhylogenyPhylogenyGenomesGenomes

TaxonomyTaxonomy

Nucleotide Nucleotide SequencesSequences

Protein Protein SequencesSequences

Article Article AbstractsAbstracts

MedlineMedline

3-D Structure

3 D 3 D StructureStructure

MMDBMMDB

Page 14: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Entrez: An integrated search and retrieval systemEntrez: An integrated search and retrieval system

Page 15: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 16: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 17: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 18: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 19: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 20: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

PubMed book links

Page 21: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

[rest of protein sequence deleted for brevity]

[rest of nucleotide sequence deleted for brevity]

GenBank RecordAccession NumberAccession Number

gi Numbergi Number

Protein SequenceProtein Sequence

Nucleotide SequenceNucleotide Sequence

Locus NameLocus Name

Medline IDMedline ID

GenPept IDGenPept ID

Page 22: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 23: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 24: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Archival databases are unreliable

• Misinterpreted experimental results• Annotations base on low similarity

gi|1968785 - cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens)

gi|6522905 - very hypothetical protein (S. pombe)

• Biologically senseless annotationsDeinococcus: head morphogenesis protein

Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein HH. pylori - brute force proteinS. cerevisiae - inside intron 7

• Propagated mistakes of sequence comparison (e.g. ABC1/ABC)

Page 25: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Advanced Neighbors: BLink

Page 26: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

BLink

Page 27: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 28: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 29: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 30: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 31: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 32: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 33: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 34: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 35: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 36: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 37: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 38: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Protein sequence motif is a descriptor of a protein family

• Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C-[LIVMFYN]-G-x-[QEH]- x-[LIVMFA]

[C is the active site residue]

• Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]

[C is the active site residue]

Page 39: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 40: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 41: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 42: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 43: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 44: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 45: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 46: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.
Page 47: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

purF gene neighbors

Page 48: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Searching MMDB

Page 49: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Principles of structural alignment

• Dali: http://www.ebi.ac.uk/dali/Looks for minimal RMSD between C atoms. Calculate C - C distance matrices, then identifies the longest alignable segments

• VAST (Vector Alignment Search Tool)http://www.ncbi.nlm.nih.gov/Structure/looks for pairs of secondary structure elements (-helices, -strands) that have similar orientation and connectivity

Page 50: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Dali alignment of Tyr phosphatase

Page 51: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

VAST Structure Neighbors

Page 52: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Structure Summary

Cn3D viewer

VAST neighbors

BLAST neighbors

Page 53: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Cn3D : Displaying Structures

Chloroquine

Page 54: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Structure Neighbors

Page 55: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Use of structural alignments

Chloroquine

NADH

Page 56: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

A catalog A catalog of human of human genes and genes and genetic genetic disordersdisorders

Online Mendelian Inheritance in ManOnline Mendelian Inheritance in ManOnline Mendelian Inheritance in ManOnline Mendelian Inheritance in Man

Page 57: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)

Associated LocusLink recordAssociated LocusLink record

External resourcesExternal resources

Additional info in OMIMAdditional info in OMIM

ContentContentss

Each record Each record provides a provides a state of the state of the art summary art summary of current of current knowledgeknowledge

Extensive Extensive references to references to literatureliterature

Page 58: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

OMIM Search Results by TitlesOMIM Search Results by TitlesOMIM Search Results by TitlesOMIM Search Results by Titles

alzheimer AND presenilin 1

Page 59: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Entrez Genome: Gene LocationEntrez Genome: Gene Location

View of View of chromosochromosome 14me 14

Gene Gene NameName

Multiple MapsMultiple MapsSTSs, ESTs, etc.STSs, ESTs, etc.

Page 60: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Entrez Entrez Genomes Map Genomes Map ViewerViewer

Chromosome Chromosome 7 7

GenBank Map GenBank Map Contig Map Contig Map STS MapSTS Map

Integrated View of Chromosome 7Integrated View of Chromosome 7

Multiple MapsMultiple MapsSTSs, ESTs, etc.STSs, ESTs, etc.

Page 61: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Entrez Genome: Gene LocationEntrez Genome: Gene Location

View of View of chromosochromosome 14me 14

Gene Gene NameName

Page 62: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Entrez Genome: Gene LocationEntrez Genome: Gene Location

Entrez Entrez Genomes Genomes Map ViewerMap Viewer

Chromosome Chromosome 14 Cytogenetic 14 Cytogenetic mapmap

Location of Location of PSEN1 and PSEN1 and surrounding surrounding genesgenes

Page 63: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

LocusLinkLocusLink

Page 64: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

LocusLinkLocusLink

Text Text queryingquerying

Multiple Multiple OrganismsOrganisms

Alphabetical Alphabetical listingslistings

Stable Locus Stable Locus IDID

Approved Approved symbolsymbol

DescriptioDescriptionn

Genome Genome PositionPosition External External

LinksLinks

Curated Curated Resource Resource

Central hub of Central hub of information information for human, for human, mouse, rat, mouse, rat, zebrafish, and zebrafish, and fruit fly locifruit fly loci

alzheimer

Page 65: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

OMIM

RefSeq

GenBank

UniGene

dbSNP

LocusLinkLocusLink

Page 66: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

LocusLink: LocusID 5663 PSEN1LocusLink: LocusID 5663 PSEN1

Page 67: Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases.

Directed by Dr. David J. Lipman

National Center for Biotechnology Information

http://www.ncbi.nlm.nih.gov