Top Banner
Protein sequence retrieval AND other database information
21

Protein sequence retrieval AND other database information

Feb 01, 2016

Download

Documents

wyanet

Protein sequence retrieval AND other database information. databases. Protein sequence(primary) SWISS-PROT PIR-International Protein sequence (composite) OWL NRDB. Protein sequence (secondary). PROSITE PRINTS Pfam. Macromolecular structures. Protein Data Bank (PDB) - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Protein sequence retrieval AND other database information

Protein sequence retrievalAND other database information

Page 2: Protein sequence retrieval AND other database information

databases

• Protein sequence(primary)– SWISS-PROT– PIR-International

• Protein sequence (composite)– OWL– NRDB

Page 3: Protein sequence retrieval AND other database information

Protein sequence (secondary)

– PROSITE– PRINTS– Pfam

Page 4: Protein sequence retrieval AND other database information

Macromolecular structures

– Protein Data Bank (PDB)– Nucleic Acids Database (NDB)– HIV Protease Database– ReLiBase– PDBsum– CATH– SCOP– FSSP

Page 5: Protein sequence retrieval AND other database information

• Nucleotide sequences– GenBank– EMBL– DDBJ

• Genome sequences– Entrez genomes– GeneCensus– COGs

Page 6: Protein sequence retrieval AND other database information

• Integrated databases– InterPro– Sequence retrieval system (SRS)– Entrez

Page 7: Protein sequence retrieval AND other database information

Protein Sequence Alignment and Database Searching

•Alignment of Two Sequences (Pair-wise Alignment)– The Scoring Schemes or Weight Matrices– Techniques of Alignments– DOTPLOT

•Multiple Sequence Alignment (Alignment of > 2 Sequences)–Extending Dynamic Programming to more sequences–Progressive Alignment (Tree or Hierarchical Methods)–Iterative Techniques

• Stochastic Algorithms (SA, GA, HMM)• Non Stochastic Algorithms

•Database Scanning– FASTA, BLAST, PSIBLAST, ISS

• Alignment of Whole Genomes– MUMmer (Maximal Unique Match)

Page 8: Protein sequence retrieval AND other database information

Input Query

DNA SequenceAmino Acid Sequence

Blastp tblastn blastn blastx tblastx

Compares Against Protein

SequenceDatabase

Compares Against

translatedNucleotide Sequence Database

Compares Against

NucleotideSequenceDatabase

Compares Against Protein

SequenceDatabase

Compares Against

translated nucleotideSequenceDatabase

An Overview of BLAST

Page 9: Protein sequence retrieval AND other database information

Comparison of Whole Genomes • MUMmer (Salzberg group, 1999,

2002)– Pair-wise sequence alignment of genomes– Assume that sequences are closely related– Allow to detect repeats, inverse repeats, SNP– Domain inserted/deleted– Identify the exact matches

• How it works– Identify the maximal unique match (MUM) in

two genomes– As two genome are similar so larger MUM

will be there– Sort the matches found in MUM and extract

longest set of possible matches that occurs in same order (Ordered MUM)

– Suffix tree was used to identify MUM– Close the gaps by SNPs, large inserts– Align region between MUMs by Smith-

Waterman

Page 10: Protein sequence retrieval AND other database information

10

Page 11: Protein sequence retrieval AND other database information

11

Page 12: Protein sequence retrieval AND other database information

12

Secondary protein database

• SWISS-PROT (1986)– Best annotated, least redundant

• PIR (Protein Information Resource)– More automated annotation– Collaborations with MIPS and JIPID

Page 13: Protein sequence retrieval AND other database information

13

Secondary protein databases• SWISS-PROT (1986)

– Best annotated, least redundant

• PIR (Protein Information Resource)– More automated annotation– Collaborations with MIPS and JIPID

• Uniprot (2003)– UniProt (Universal Protein Resource) is a central

repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR.

Page 14: Protein sequence retrieval AND other database information

14

Databases

• Primary (archival)– GenBank/EMBL/DDBJ– UniProt– PDB– Medline (PubMed)– BIND

• Secondary (curated)– RefSeq– Taxon– UniProt– OMIM– SGD

Page 15: Protein sequence retrieval AND other database information

15

Organismal DivisionsUsed in which database?

BCT Bacterial DDBJ - GenBankFUN Fungal EMBLHUM Homo sapiens DDBJ - EMBLINV Invertebrate allMAM Other mammalian allORG Organelle EMBLPHG Phage allPLN Plant allPRI Primate (also see HUM) all (not same data in all)PRO Prokaryotic EMBLROD Rodent allSYN Synthetic and chimeric allVRL Viral allVRT Other vertebrate all

Page 16: Protein sequence retrieval AND other database information

16

Functional DivisionsPAT Patent EST Expressed Sequence TagsSTS Sequence Tagged SiteGSS Genome Survey Sequence HTG High Throughput Genome (unfinished)HTC High throughput cDNA (unfinished)CON Contig assembly instructions

Organismal divisions:

BCT FUN INV MAM PHG PLNPRI ROD SYN VRL VRT

Page 17: Protein sequence retrieval AND other database information

17

EST: Expressed Sequence Tag

Expressed Sequence Tags are short (300-500 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage.

Also see: http://www.ncbi.nlm.nih.gov/dbEST/ http://www.ncbi.nlm.nih.gov/UniGene/

Page 18: Protein sequence retrieval AND other database information

18

STSSequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome.

Also see: http://www.ncbi.nlm.nih.gov/dbSTS/ http://www.ncbi.nlm.nih.gov/genemap/

Page 19: Protein sequence retrieval AND other database information

19

GSS: Genome Survey Sequences Genome Survey Sequences are similar in nature to the ESTs, except that its sequences are genomic in origin, rather than cDNA (mRNA).

The GSS division contains:• random "single pass read" genome survey sequences.• single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be)• exon trapped genomic sequences• Alu PCR sequences

Also see: http://www.ncbi.nlm.nih.gov/dbGSS/

Page 20: Protein sequence retrieval AND other database information

20

HTG: High Throughput Genome High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have gaps in the nucleotides sequence, low accuracy, and no annotations on the records.

Also see: http://www.ncbi.nlm.nih.gov/HTGS/ Ouellette and Boguski (1997) Genome Res. 7:952-955

Page 21: Protein sequence retrieval AND other database information

21

Which tool?mRNA Genomic

EST Other Other STS/GSS HTGS

dbEST Simple •Better control of annotations•pop/phylo•segmented sets

Simple dbSTSdbGSS

Customized software or tbl2asn

WWWBankIt

WWWBankIt

E-mailor FTP

E-mailor FTP

E-mailor FTP

Sequinor tbl2asn

E-mail