Michael Y. Galperin National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, Maryland, USA Databases (“knowledge bases”) used in genome analysis
Dec 16, 2015
Michael Y. Galperin National Center for Biotechnology InformationNational Library of MedicineNational Institutes of HealthBethesda, Maryland, USA
Databases (“knowledge bases”)
used in genome analysis
Databases (“knowledge bases”)
used in genome analysis
J. Smith - a very common name
Structure - a very common term
Glutamine amidotransferase - less common term but not
a very good descriptor
• Databases– PubMed and other NCBI databases– Biochemical databases– Protein domain databases– Structural databases– Genome comparison databases
• Tools– CDD / COGs– VAST / FSSP
Tools of trade for the “armchair scientist”
• Archival or Primary Data – Text: PubMed– DNA Sequence: GenBank– Protein Sequence: Entrez Proteins, TREMBL– Protein Structures: PDB
• Curated or Processed Data– DNA sequences : RefSeq, LocusLink, OMIM– Protein Sequences: SWISS-PROT, PIR– Protein Structures : SCOP, CATH, MMDB– Genomes: Entrez Genomes, COGs
Types of databasesTypes of databases
Nucleic Acids Research: Database Issue each January 1 Articles on ~100 different databases
The National Center for Biotechnology Information (NCBI)
• Created as a part of the National Library of Medicine, National Institutes of Health in 1988– Establish public databases– Research in computational biology– Develop software tools for sequence analysis– Disseminate biomedical information
• Tools: BLAST(1990), Entrez (1992)• GenBank (1992)• Free MEDLINE (PubMed, 1997)• Other databases: dbEST, dbGSS, dbSTS,
MMDB, OMIM, UniGene, Taxonomy, GeneMap, SAGE, LocusLink,
RefSeq
What is GenBank?• Archival nucleotide sequence database• Sample slogans:
“Easy deposits, unlimited withdrawals, high interest”, “All bases covered”, “Billions and billions served”
• Data are shared nightly among three collaborating databases:
• GenBank at NCBI - Bethesda, Maryland, USA• DNA Database of Japan (DDBJ) at NIG -
Mishima, Japan• European Molecular Biology LaboratoryEuropean Molecular Biology Laboratory
DatabaseDatabase (EMBL) at EBI - Hinxton, UK
Some guiding principles of working with GenBank
• GenBank is a nucleotide-centric view of the information space
• GenBank is a repository of all publically available sequences
• In GenBank, records are grouped for various reasons
• Data in GenBank is only as good as what you put in
NCBI databases and their linksNCBI databases and their links
Word WeightWord Weight
VASTVAST
BLASTBLASTBLASTBLAST
PhylogenyPhylogenyGenomesGenomes
TaxonomyTaxonomy
Nucleotide Nucleotide SequencesSequences
Protein Protein SequencesSequences
Article Article AbstractsAbstracts
MedlineMedline
3-D Structure
3 D 3 D StructureStructure
MMDBMMDB
[rest of protein sequence deleted for brevity]
[rest of nucleotide sequence deleted for brevity]
GenBank RecordAccession NumberAccession Number
gi Numbergi Number
Protein SequenceProtein Sequence
Nucleotide SequenceNucleotide Sequence
Locus NameLocus Name
Medline IDMedline ID
GenPept IDGenPept ID
Archival databases are unreliable
• Misinterpreted experimental results• Annotations base on low similarity
gi|1968785 - cDNA 5' end similar to similar to arrest- defective protein isolog (H. sapiens)
gi|6522905 - very hypothetical protein (S. pombe)
• Biologically senseless annotationsDeinococcus: head morphogenesis protein
Arabidopsis: separation anxiety protein-like Yersinia: automembrane protein HH. pylori - brute force proteinS. cerevisiae - inside intron 7
• Propagated mistakes of sequence comparison (e.g. ABC1/ABC)
Protein sequence motif is a descriptor of a protein family
• Glutamine amidotransferase class I [PAS]-[LIVMFYT]-[LIVMFY]-G-[LIVMFY]-C-[LIVMFYN]-G-x-[QEH]- x-[LIVMFA]
[C is the active site residue]
• Glutamine amidotransferase class II <x(0,11)-C-[GS]-[IV]-[LIVMFYW]-[AG]
[C is the active site residue]
Principles of structural alignment
• Dali: http://www.ebi.ac.uk/dali/Looks for minimal RMSD between C atoms. Calculate C - C distance matrices, then identifies the longest alignable segments
• VAST (Vector Alignment Search Tool)http://www.ncbi.nlm.nih.gov/Structure/looks for pairs of secondary structure elements (-helices, -strands) that have similar orientation and connectivity
A catalog A catalog of human of human genes and genes and genetic genetic disordersdisorders
Online Mendelian Inheritance in ManOnline Mendelian Inheritance in ManOnline Mendelian Inheritance in ManOnline Mendelian Inheritance in Man
OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)OMIM record for Presenilin 1 (PSEN1)
Associated LocusLink recordAssociated LocusLink record
External resourcesExternal resources
Additional info in OMIMAdditional info in OMIM
ContentContentss
Each record Each record provides a provides a state of the state of the art summary art summary of current of current knowledgeknowledge
Extensive Extensive references to references to literatureliterature
OMIM Search Results by TitlesOMIM Search Results by TitlesOMIM Search Results by TitlesOMIM Search Results by Titles
alzheimer AND presenilin 1
Entrez Genome: Gene LocationEntrez Genome: Gene Location
View of View of chromosochromosome 14me 14
Gene Gene NameName
Multiple MapsMultiple MapsSTSs, ESTs, etc.STSs, ESTs, etc.
Entrez Entrez Genomes Map Genomes Map ViewerViewer
Chromosome Chromosome 7 7
GenBank Map GenBank Map Contig Map Contig Map STS MapSTS Map
Integrated View of Chromosome 7Integrated View of Chromosome 7
Multiple MapsMultiple MapsSTSs, ESTs, etc.STSs, ESTs, etc.
Entrez Genome: Gene LocationEntrez Genome: Gene Location
View of View of chromosochromosome 14me 14
Gene Gene NameName
Entrez Genome: Gene LocationEntrez Genome: Gene Location
Entrez Entrez Genomes Genomes Map ViewerMap Viewer
Chromosome Chromosome 14 Cytogenetic 14 Cytogenetic mapmap
Location of Location of PSEN1 and PSEN1 and surrounding surrounding genesgenes
LocusLinkLocusLink
Text Text queryingquerying
Multiple Multiple OrganismsOrganisms
Alphabetical Alphabetical listingslistings
Stable Locus Stable Locus IDID
Approved Approved symbolsymbol
DescriptioDescriptionn
Genome Genome PositionPosition External External
LinksLinks
Curated Curated Resource Resource
Central hub of Central hub of information information for human, for human, mouse, rat, mouse, rat, zebrafish, and zebrafish, and fruit fly locifruit fly loci
alzheimer