Top Banner
1 Exercise: Exercise: BIOINFORMATIC DATABASES BIOINFORMATIC DATABASES and and BLAST BLAST
54

Exercise: BIOINFORMATIC DATABASES and BLAST

Jan 02, 2016

Download

Documents

kaydence-rojas

Exercise: BIOINFORMATIC DATABASES and BLAST. Outline. NCBI and Entrez Pubmed Google scholar RefSeq Swissprot Fasta format PDB : Protein Data Bank Organism specific databases Summary Pairwise Sequence Alignment and BLAST Overview Query type: DNA or Protein. What’s in a database?. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Exercise: BIOINFORMATIC DATABASES and BLAST

11

Exercise:Exercise:BIOINFORMATIC DATABASESBIOINFORMATIC DATABASES

andandBLASTBLAST

Page 2: Exercise: BIOINFORMATIC DATABASES and BLAST

22

OutlineOutline

NCBI and EntrezNCBI and Entrez PubmedPubmed Google scholarGoogle scholar RefSeqRefSeq SwissprotSwissprot Fasta formatFasta format PDBPDB:: Protein Data Bank Protein Data Bank Organism specific databasesOrganism specific databases SummarySummary Pairwise Sequence Alignment and BLASTPairwise Sequence Alignment and BLAST OverviewOverview Query type: DNA or ProteinQuery type: DNA or Protein

Page 3: Exercise: BIOINFORMATIC DATABASES and BLAST

33

What’s in a database?What’s in a database? Sequences – genes, proteins, etcSequences – genes, proteins, etc

Full genomesFull genomes

Annotation – information about genes/proteins:Annotation – information about genes/proteins:- function- function- cellular location- cellular location- chromosomal location- chromosomal location- introns/exons- introns/exons- protein structure- protein structure- phenotypes, diseases- phenotypes, diseases

PublicationsPublications

Page 4: Exercise: BIOINFORMATIC DATABASES and BLAST

44

NCBI and EntrezNCBI and EntrezNational center for biotechnology informationNational center for biotechnology information

One of the largest and most comprehensive One of the largest and most comprehensive databases belonging to the NIH (national databases belonging to the NIH (national institute of health)institute of health) The primary Federal agency for conducting and The primary Federal agency for conducting and

supporting medical research in the USAsupporting medical research in the USA Entrez is the search engine of NCBIEntrez is the search engine of NCBI Search for :Search for :

genes, proteins, genomes, structures, diseases, genes, proteins, genomes, structures, diseases, publications and morepublications and more

httphttp://://wwwwww..ncbincbi..nlmnlm..nihnih..govgov//

Page 5: Exercise: BIOINFORMATIC DATABASES and BLAST

55

PubMed: search for published papersPubMed: search for published papers

Yang X, Kurteva S, Ren X, Lee S,Yang X, Kurteva S, Ren X, Lee S,Sodroski JSodroski J.. “Subunit stoichiometry of human immunodeficiency virus “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host type 1 envelope glycoprotein trimers during virus entry into host

cells “, J Virolcells “, J Virol.. 2006 May;80(9):4388-95. 2006 May;80(9):4388-95.

Page 6: Exercise: BIOINFORMATIC DATABASES and BLAST

66

Use fields!Use fields!Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA]

For the full list of field tags: go to help -> Search Field Descriptions and Tags

Page 7: Exercise: BIOINFORMATIC DATABASES and BLAST

77

ExerciseExercise

Retrieve all publications in which the Retrieve all publications in which the first first author is:author is: Pe'er I Pe'er I and the and the last author is:last author is: Shamir RShamir R

Page 8: Exercise: BIOINFORMATIC DATABASES and BLAST

88

Using limitsUsing limits

Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years

Page 9: Exercise: BIOINFORMATIC DATABASES and BLAST

99

Google scholarGoogle scholarhttp://scholar.google.com/

Page 10: Exercise: BIOINFORMATIC DATABASES and BLAST

1010

Page 11: Exercise: BIOINFORMATIC DATABASES and BLAST

1111

NCBI gene & protein databases: NCBI gene & protein databases: GenBankGenBank

GenBankGenBank is an annotated collection of all is an annotated collection of all publicly available DNA sequences (and publicly available DNA sequences (and their amino-acid translations)their amino-acid translations)

Holds Holds 9999 billionbillion bases (2008)bases (2008)

Page 12: Exercise: BIOINFORMATIC DATABASES and BLAST

1212

Searching NCBI for the protein Searching NCBI for the protein human CD4human CD4

Search demonstrationSearch demonstration

Page 13: Exercise: BIOINFORMATIC DATABASES and BLAST

1313

Page 14: Exercise: BIOINFORMATIC DATABASES and BLAST

1414

Using field descriptions, qualifiers, Using field descriptions, qualifiers, and boolean operatorsand boolean operators

Cd4[GENE] AND human[ORGN] Cd4[GENE] AND human[ORGN] Or Or Cd4[gene name] AND human[organism]Cd4[gene name] AND human[organism]

List of field codes: List of field codes: httphttp://://wwwwww..ncbincbi..nlmnlm..nihnih..govgov//entrezentrez//queryquery//staticstatic//helphelp//Summary_MatricesSummary_Matrices..html#Search_Fields_and_Qualifiershtml#Search_Fields_and_Qualifiers

Boolean Operators:Boolean Operators:ANDANDORORNOTNOT

Note: do not use the field Protein name [PROT], only GENE!Note: do not use the field Protein name [PROT], only GENE!

Page 15: Exercise: BIOINFORMATIC DATABASES and BLAST

1515

This time we directly search in the protein databaseThis time we directly search in the protein database

Page 16: Exercise: BIOINFORMATIC DATABASES and BLAST

1616

RefSeqRefSeq RefSeq: sub-collection of NCBI databases with RefSeq: sub-collection of NCBI databases with

only non-redundant, highly annotated entries only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein (genomic DNA, transcript (RNA), and protein products)products)

Page 17: Exercise: BIOINFORMATIC DATABASES and BLAST

1717

Page 18: Exercise: BIOINFORMATIC DATABASES and BLAST

1818An explanation on GenBank records

Page 19: Exercise: BIOINFORMATIC DATABASES and BLAST

2020

SwissprotSwissprot

A protein sequence database which A protein sequence database which strives to provide a high level of strives to provide a high level of annotation:annotation:* the function of a protein* the function of a protein* domains structure* domains structure* post* post--translational modificationstranslational modifications* variants* variants

One entry for each proteinOne entry for each protein

Page 20: Exercise: BIOINFORMATIC DATABASES and BLAST

2121

Page 21: Exercise: BIOINFORMATIC DATABASES and BLAST

2222

GenBank Vs. SwissprotGenBank Vs. Swissprot

GenBank results Swiss-Prot results

Page 22: Exercise: BIOINFORMATIC DATABASES and BLAST

2323

Fasta formatFasta format

> gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1

header

ID/accession description

sequence

Page 23: Exercise: BIOINFORMATIC DATABASES and BLAST

2424

DownloadingDownloading

Page 24: Exercise: BIOINFORMATIC DATABASES and BLAST

2525

PDBPDB:: Protein Data Bank Protein Data Bank

Main database of 3D structuresMain database of 3D structures Includes ~56,000 entries (Includes ~56,000 entries (proteinsproteins, ,

nucleic acids, others)nucleic acids, others) Proteins organized in groups, families etcProteins organized in groups, families etc Is highly redundant Is highly redundant

different conformations (e.g., ligand different conformations (e.g., ligand dependent)dependent)

http://www.rcsb.orghttp://www.rcsb.org

Page 25: Exercise: BIOINFORMATIC DATABASES and BLAST

2626

Human CD4 in complex with HIV gp120Human CD4 in complex with HIV gp120

gp120

CD4

PDB ID 1G9M

Page 26: Exercise: BIOINFORMATIC DATABASES and BLAST

2727

Model organisms have independent databases:Model organisms have independent databases:

Organism specific databasesOrganism specific databases

HIV database http://hiv-web.lanl.gov/content/index

http://gmod.org/wiki/Main_Page?q=node/71

Page 27: Exercise: BIOINFORMATIC DATABASES and BLAST

2828

SummarySummary

General and comprehensive databases:General and comprehensive databases: NCBI, EMBL, DDBJNCBI, EMBL, DDBJ

Genome specific databases:Genome specific databases: ENSEMBL, UCSC genome browserENSEMBL, UCSC genome browser

Highly annotated databases:Highly annotated databases: Proteins:Proteins:

• Swissprot, RefSeqSwissprot, RefSeq Structures:Structures:

• PDBPDB

Page 28: Exercise: BIOINFORMATIC DATABASES and BLAST

2929

And always remember:And always remember:

1.1.GoogleGoogle (or any search engine) (or any search engine)

2.2.RTFM -RTFM -

Read the manual!!! (/help/FAQ)Read the manual!!! (/help/FAQ)

Page 29: Exercise: BIOINFORMATIC DATABASES and BLAST

3030

Pairwise Pairwise Sequence Sequence

Alignment and Alignment and BLASTBLAST

Page 30: Exercise: BIOINFORMATIC DATABASES and BLAST

3131

What is sequence alignment?What is sequence alignment?

Alignment: Alignment: Comparing two (pairwise) or Comparing two (pairwise) or more (multiple) sequences. Searching for more (multiple) sequences. Searching for a series of identical or similar characters in a series of identical or similar characters in the sequences.the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE|| || ||||| ||| || || ||MVHLTPEEKTAVNALWGKVNVDAVGGE

Page 31: Exercise: BIOINFORMATIC DATABASES and BLAST

3232

Local vs. GlobalLocal vs. Global Global alignmentGlobal alignment – finds the best – finds the best

alignment across the alignment across the wholewhole two two sequences.sequences.

Local alignmentLocal alignment – finds regions of – finds regions of high similarity in high similarity in partsparts of the of the sequences.sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Page 32: Exercise: BIOINFORMATIC DATABASES and BLAST

3333

In the course of evolution, the sequences In the course of evolution, the sequences changed from the ancestral sequence by changed from the ancestral sequence by random mutationsrandom mutations

Three types of mutations:Three types of mutations:

1.1. InsertionInsertion - AAGA - AAGA AAG AAGTTAA

2.2. DeletionDeletion - A - AAAGAGA AGA AGA

3.3. SubstitutionSubstitution -- AA AAGGAA AA AACCAA

Evolutionary changes in sequencesEvolutionary changes in sequences

InsertionInsertion + + DeletionDeletion IndelIndel

Page 33: Exercise: BIOINFORMATIC DATABASES and BLAST

3434

Scoring schemeScoring scheme

Match/mismatch scores: substitution matricesMatch/mismatch scores: substitution matrices Nucleic acids:Nucleic acids:

• Transition-transversionTransition-transversion Amino acids:Amino acids:

• Evolution (empirical data) based: (PAM, BLOSUM)Evolution (empirical data) based: (PAM, BLOSUM)• Physico-chemical properties based (Grantham, Physico-chemical properties based (Grantham,

McLachlan)McLachlan)

Gap penaltyGap penalty

Page 34: Exercise: BIOINFORMATIC DATABASES and BLAST

3535

Computation time:Computation time:How do we search a database?How do we search a database?

If each pairwise alignment takes 1/10 of a If each pairwise alignment takes 1/10 of a second, and if the database contains 10second, and if the database contains 107 7

sequences, it will take sequences, it will take 101066 seconds seconds = = 11.511.5 daysdays to complete one search. to complete one search.

150,000 searches (at least!!) are 150,000 searches (at least!!) are performed per day. >82,000,000 sequence performed per day. >82,000,000 sequence records in GenBank.records in GenBank.

Page 35: Exercise: BIOINFORMATIC DATABASES and BLAST

3636

ConclusionConclusion

Using the exact comparison pairwise Using the exact comparison pairwise alignment algorithm between the query alignment algorithm between the query and all DB entries – too slowand all DB entries – too slow

Page 36: Exercise: BIOINFORMATIC DATABASES and BLAST

3737

HeuristicHeuristic

Definition:Definition: a heuristic is a a heuristic is a design to solve a problem that design to solve a problem that does not provide an exact does not provide an exact solution (but is not too bad) but solution (but is not too bad) but reduces the time complexity of reduces the time complexity of the exact solutionthe exact solution

Page 37: Exercise: BIOINFORMATIC DATABASES and BLAST

3838

BLASTBLAST

BLAST - BLAST - BBasic asic LLocal ocal AAlignment and lignment and SSearch earch TToolool

A heuristic for searching a database for A heuristic for searching a database for similar sequencessimilar sequences

The heuristic based on restrictions of the The heuristic based on restrictions of the similarity (such as using similarity (such as using ungapped word ungapped word matching instead of single character matching instead of single character matching).matching).

Page 38: Exercise: BIOINFORMATIC DATABASES and BLAST

3939

Query: DNA Protein

Database: DNA Protein

Query type: DNA or ProteinQuery type: DNA or Protein All types of searches are possibleAll types of searches are possible

blastn – nuc vs. nucblastp – prot vs. prot

blastx – translated query vs. protein database

tblastn – protein vs. translated nuc. DB

tblastx – translated query vs. translated database

Page 39: Exercise: BIOINFORMATIC DATABASES and BLAST

4040

Query typeQuery type

Information content in the letters:Information content in the letters: Nucleotides: 4 letter alphabetNucleotides: 4 letter alphabet Amino acids: 20 letter alphabetAmino acids: 20 letter alphabet

• Two random DNA sequences will, on average, have 25% identity• Two random protein sequences will, on average, have 5% identity

The amino-acid sequence is often preferable for homology search

Selection (and hence conservation) works (mostly) at Selection (and hence conservation) works (mostly) at the protein levelthe protein level

Page 40: Exercise: BIOINFORMATIC DATABASES and BLAST

4141

E-valueE-value The number of times we will theoretically The number of times we will theoretically

find an alignment with a score ≥ find an alignment with a score ≥ YY of a of a random sequence vs. a random databaserandom sequence vs. a random database

Theoretically, we could trust

any result with an

E-value ≤ 1

In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a

significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe

non-homologous).E-values between 10-2 and 1 do not

indicate a good homology

Page 41: Exercise: BIOINFORMATIC DATABASES and BLAST

4242

Filtering low complexityFiltering low complexity

Low complexity regionsLow complexity regions : e.g., Proline rich : e.g., Proline rich areas (in proteins), Alu repeats (in DNA)areas (in proteins), Alu repeats (in DNA)

Regions of low complexity generate high Regions of low complexity generate high scores of alignment, BUT – this does not scores of alignment, BUT – this does not indicate homologyindicate homology

Page 42: Exercise: BIOINFORMATIC DATABASES and BLAST

4343

BLAST 2 sequences at NCBI BLAST 2 sequences at NCBI

Produces the Produces the locallocal alignment of two given alignment of two given sequences using BLAST (Basic Local sequences using BLAST (Basic Local Alignment Search Tool)Alignment Search Tool) engine for local engine for local alignmentalignment

Does not use an optimal algorithm but a Does not use an optimal algorithm but a heuristicheuristic

Page 43: Exercise: BIOINFORMATIC DATABASES and BLAST

4444

Back to NCBIBack to NCBI

Page 44: Exercise: BIOINFORMATIC DATABASES and BLAST

4545

BLAST – bl2seqBLAST – bl2seq

Page 45: Exercise: BIOINFORMATIC DATABASES and BLAST

4646

blastnblastn – nucleotide – nucleotide

blastpblastp – protein – protein

Bl2Seq - queryBl2Seq - query

Page 46: Exercise: BIOINFORMATIC DATABASES and BLAST

4747

Bl2seq resultsBl2seq results

Page 47: Exercise: BIOINFORMATIC DATABASES and BLAST

4848

Bl2seq resultsBl2seq results

MatchMatch DissimilarityDissimilarity GapsGaps SimilaritySimilarity Low Low

complexitycomplexity

Page 48: Exercise: BIOINFORMATIC DATABASES and BLAST

4949

BLAST – programsBLAST – programs

Query: DNA Protein

Database: DNA Protein

Page 49: Exercise: BIOINFORMATIC DATABASES and BLAST

5050

BLAST – BlastpBLAST – Blastp

Page 50: Exercise: BIOINFORMATIC DATABASES and BLAST

5151

Blastp - resultsBlastp - results

Page 51: Exercise: BIOINFORMATIC DATABASES and BLAST

5252

Blastp – results (cont’)Blastp – results (cont’)

Page 52: Exercise: BIOINFORMATIC DATABASES and BLAST

5353

Blastp – acquiring sequencesBlastp – acquiring sequences

Page 53: Exercise: BIOINFORMATIC DATABASES and BLAST

5454

Blastp – acquiring sequences Blastp – acquiring sequences (cont’)(cont’)

Page 54: Exercise: BIOINFORMATIC DATABASES and BLAST

5555

Fasta format – multiple sequencesFasta format – multiple sequences>gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH

>gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

>gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH

>gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH

>gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH