Top Banner
Bioinformatics Resources and Tools on the Web: A Primer
24

Bioinformatics Resources and Tools on the Web: A Primer.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bioinformatics Resources and Tools on the Web: A Primer.

Bioinformatics Resources and Tools on the Web: A Primer

Page 2: Bioinformatics Resources and Tools on the Web: A Primer.

Outline• Introduction: What is bioinformatics?• The basics

– The five sites that all biologists should know

• Some examples– Using the tools in a somewhat less-than-naïve manner

• Questions/comments are welcome at all points• Much of this material comes from the Boston

University course: BF527 Bioinformatic Applications (http://matrix.bu.edu/BF527/)

Page 3: Bioinformatics Resources and Tools on the Web: A Primer.

What is bioinformatics?

Page 4: Bioinformatics Resources and Tools on the Web: A Primer.

Examples of Bioinformatics

• Database interfaces– Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …

• Sequence alignment– BLAST, FASTA

• Multiple sequence alignment– Clustal, MultAlin, DiAlign

• Gene finding– Genscan, GenomeScan, GeneMark, GRAIL

• Protein Domain analysis and identification– pfam, BLOCKS, ProDom,

• Pattern Identification/Characterization– Gibbs Sampler, AlignACE, MEME

• Protein Folding prediction– PredictProtein, SwissModeler

Page 5: Bioinformatics Resources and Tools on the Web: A Primer.

Things to know and remember about using web server-based tools

• You are using someone else’s computer

• You are (probably) getting a reduced set of options or capacity

• Servers are great for sporadic or proof-of-principle work, but for intensive work, the software should be obtained and run locally

Page 7: Bioinformatics Resources and Tools on the Web: A Primer.

NCBI (http://www.ncbi.nlm.nih.gov/)

• Entrez interface to databases– Medline/OMIM– Genbank/Genpept/Structures

• BLAST server(s)– Five-plus flavors of blast

• Draft Human Genome• Much, much more…

Page 8: Bioinformatics Resources and Tools on the Web: A Primer.

EBI (http://www.ebi.ac.uk/)

• SRS database interface– EMBL, SwissProt, and many more

• Many server-based tools– ClustalW, DALI, …

Page 9: Bioinformatics Resources and Tools on the Web: A Primer.

SwissProt (http://expasy.cbr.nrc.ca/sprot/)

• Curation!!!– Error rate in the information is greatly reduced in

comparison to most other databases.

• Extensive cross-linking to other data sources• SwissProt is the ‘gold-standard’ by which

other databases can be measured, and is the best place to start if you have a specific protein to investigate

Page 10: Bioinformatics Resources and Tools on the Web: A Primer.

A few more resources to be aware of

• Human Genome Working Draft– http://genome.ucsc.edu/

• TIGR (The Institute for Genomics Research)– http://www.tigr.org/

• Celera– http://www.celera.com/

• (Model) Organism specific information:– Yeast: http://genome-www.stanford.edu/Saccharomyces/– Arabidopis: http://www.tair.org/– Mouse: http://www.jax.org/– Fruitfly: http://www.fruitfly.org/– Nematode: http://www.wormbase.org/

• Nucleic Acids Research Database Issue– http://nar.oupjournals.org/ (First issue every year)

Page 11: Bioinformatics Resources and Tools on the Web: A Primer.

Example 1: Searching a new genome for a specific protein

• Specific problem: We want to find the closest match in C. elegans of D. melanogaster protein NTF1, a transcription factor

• First- understanding the different forms of blast

Page 12: Bioinformatics Resources and Tools on the Web: A Primer.

The different versions of BLAST

Page 13: Bioinformatics Resources and Tools on the Web: A Primer.

1st Step: Search the proteins

• blastp is used to search for C. elegans proteins that are similar to NTF1

• Two reasonable hits are found, but the hits have suspicious characteristics – besides the fact that they weren’t included in the

complete genome!

Page 14: Bioinformatics Resources and Tools on the Web: A Primer.

2nd Step: Search the nucleotides

• tblastn is used to search for translations of C. elegans nucleotide that are similar to NTF1

• Now we have only one hit – How are they related?

Page 15: Bioinformatics Resources and Tools on the Web: A Primer.

Conclusion: Incorrect gene prediction/annotation

• The two predicted proteins have essentially identical annotation

• The protein-protein alignments are disjoint and consecutive on the protein

• The protein-nucleotide alignment includes both protein-protein alignments in the proper order

• Why/how does this happen?

Page 16: Bioinformatics Resources and Tools on the Web: A Primer.

Final(?) Check: Gene prediction

• Genscan is the best available ab initio gene predictor– http://genes.mit.edu/GENSCAN.html

• Genscan’s prediction spans both protein-protein alignments, reinforcing our conclusion of a bad prediction

Page 17: Bioinformatics Resources and Tools on the Web: A Primer.

Ab initio vs. similarity vs. hybrid models for gene finding

• Ab initio: The gene looks like the average of many genes– Genscan, GeneMark, GRAIL…

• Similarity: The gene looks like a specific known gene– Procrustes,…

• Hybrid: A combination of both– Genomescan (http://genes.mit.edu/genomescan/)

Page 18: Bioinformatics Resources and Tools on the Web: A Primer.

A similar example: Fruitfly homolog of mRNA localization protein VERA

• Similar procedure as just described– Tblastn search with BLOSUM45 produces an unexpected exon

• Conclusion: Incomplete (as opposed to incorrect) annotation– We have verified the existence of the rare isoform through RT-

PCR

Page 19: Bioinformatics Resources and Tools on the Web: A Primer.

Another example: Find all genes with pdz domains

• Multiple methods are possible

• The ‘best’ method will depend on many things– How much do you know about the domain?– Do you know the exact extent of the domain?– How many examples do you expect to find?

Page 20: Bioinformatics Resources and Tools on the Web: A Primer.

Some possible methods if the domain is a known domain:

• SwissProt – text search capabilities– good annotation of known domains– crosslinks to other databases (domains)

• Databases of known domains:– BLOCKS (http://blocks.fhcrc.org/)– Pfam (http://pfam.wustl.edu/)– Others (ProDom, ProSite, DOMO,…)

Page 21: Bioinformatics Resources and Tools on the Web: A Primer.

Determination of the nature of conservation in a domain

• For new domains, multiple alignment is your best option– Global: clustalw– Local: DiAlign– Hidden Markov Model: HMMER

• For known domains, this work has largely been done for you– BLOCKS– Pfam

Page 22: Bioinformatics Resources and Tools on the Web: A Primer.

If you have a protein, and want to search it to known domains

• Search/Analysis tools– Pfam– BLOCKS– PredictProtein

(http://cubic.bioc.columbia.edu/predictprotein/predictprotein.html)

Page 23: Bioinformatics Resources and Tools on the Web: A Primer.

Different representations of conserved domains

• BLOCKS– Gapless regions– Often multiple blocks for one domain

• PFAM– Statistical model, based on HMM– Since gaps are allowed, most domains have only

one pfam model

Page 24: Bioinformatics Resources and Tools on the Web: A Primer.

Conclusions

• We have only touched small parts of the elephant

• Trial and error (intelligently) is often your best tool

• Keep up with the main five sites, and you’ll have a pretty good idea of what is happening and available