Bioinformatics Workshop 1 Sequences and Similarity Searches • Open a web browser and type in the URL: – informatics.gurdon.cam.ac.uk/online/ workshops – Bookmark this page • Click on the link to the file: – useful-websites.html – Bookmark this page too – It also contains links to the example sequence files used in the workshop, and the presentations themselves
84
Embed
Bioinformatics Workshop 1 Sequences and Similarity Searches
Bioinformatics Workshop 1 Sequences and Similarity Searches. Open a web browser and type in the URL: informatics.gurdon.cam.ac.uk/online/workshops Bookmark this page Click on the link to the file: useful-websites.html Bookmark this page too - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Bioinformatics Workshop 1Sequences and Similarity Searches
bull Open a web browser and type in the URLndash informaticsgurdoncamacukonlineworkshopsndash Bookmark this page
bull Click on the link to the filendash useful-websiteshtmlndash Bookmark this page toondash It also contains links to the example sequence
files used in the workshop and the presentations themselves
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Gene PredictionsGiven- coding sequence must run from ATG ndash STOP codon in-frame- introns GT AG can be spliced out
Also take a statistical approach- coding and non-coding sequence are slightly different in composition- some lsquopossiblersquo splice sites are more likely than others
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Supporting Evidence
EST evidence
genome
gene model
We note that in the absence of EST evidence it is only really possible to predict coding sequence with any confidence (and even thenhellip)
So predicted genes based on computational gene models alone will usually lack UTR regions which has some important consequences
exons 1 2 3 4
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
TheoreticalPredicted Sequences
genome
predicted gene modelexons 1 2 3 4
Wersquove now reversed the process of working out exon structure from aligning cDNA sequences against the genome sequence but we shouldnrsquot lose sight of the fact that we donrsquot really know if these predicted proteins exists ndash especially where supporting EST evidence is weak or non-existent
predicted transcript
predicted protein
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Sequences for a model organism
ESTs ndash millions pound10 eachCheap to sequence ndash so we get millions per organismBut lots of errorsAnd incomplete gene sequencesCan give us relative expression levels
cDNAs ndash tens of thousands pound1000 eachExpensive ndash but only need to do one (or a small number) per geneFew errors with multipass sequencingGives us protein sequences
Genomes ndash one pound30000000Extremely expensiveBut the only way to get the whole pictureGives us gene regulation
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
If the genetic difference means they can no longer interbreed with fertile offspring ndash then we have a new specieshellip
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Residual Similarity
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
ATGCATGCTGCCAACGGATGCCCTG
ATGGAAGGCGCTTAGGATAGTCCAG||| | | || | | | || |
After longer periods of evolution homology may no longer be detectable in the DNA sequencehellip
We can still easily detect residual similarity between these sequences this is what we call homology ndash detectable similarity because of common evolutionary origin
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Computers Can Detect Homology
In fact computers are very good at this task ndash the two primary challenges are
(a) performing the search fast enough to look through millions of sequence in a timescale compatible with a lab scientistrsquos attention span
(b) at low levels of similarity being able to distinguish between biologically related sequences and chance matcheshellip
ATGCATGCTGCCAACGGATGCCCTG
ATGAAAGCCGCCTACGACAGTCCTG||| | || ||| ||| | ||||
GCTGACTCGTAGCGCTTAGCTAGCT
CCAACATCTAGCCAGATTAGTTAGT | || | | | |
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Orthologs
A A
A Gene duplication though speciation The two copies of Gene
A will now evolve independently but will continue to have the ~same function
They are ORTHOLOGS
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Paralogs
A
Gene duplication though internal genome duplication
The two copies of Gene A will now evolve independently but will probably not continue to have exactly the same function
They are PARALOGS
A
A Arsquo
A
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
lsquoOtherrsquo-logsWhat about gene duplication after speciation
How can we describe the relationship(s) between the various copies of gene A in the two frogs
Bear in mind that understanding gene function is more important than semanticshellip
The two copies of A in the orange frog are sometimes called IN-PARALOGS
If they were also present in the green frog (and therefore were in the ancestor species) they would be OUT-PARALOGS
A
A
A
Arsquo A
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
The Essential Paradigm
1 any group of modern species can be traced back to some extinct common ancestor
A
A
2 in all likelihood they share orthologous genes which have the same function in the modern animal as in the extinct ancestor
3 If we can experimentally determine the function of a gene in one of these organisms then there is a good chance the ORTHOLOGOUS gene in another organism will have the same function
A A
cyclin b1
cyclin b1
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Function Conserved Longer than Detectable Similarity
start from first self-replicating sequence
same function detectable similarity
living organisms
whole genome duplication local duplication
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Redundancy in the Genetic Code
GCA A alanine GCC A GCG A GCT A
TGC C cystine TGT C
GAC D aspartate GAT D
GGA G glycine GGC G GGG G GGT G
lsquoSynonymousrsquo or lsquosilentrsquo mutations in the third position of the codon triplets have no effect on the amino acid coded for ndash so there is no evolutionary pressure against thishellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
The DNA sequence can change while the amino acid sequence stays the same so always look for similarities by comparing amino acid sequences
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Exercise 1nucleotide vs amino acid search
Go to the file example-sequenceshtml and locate the section for this exercise There should be two sequences lsquosurfeit1rsquo for frog and fly
Go to NCBI Blast home page then lsquoAlign two sequencesrsquo (bottom left lsquospecialrsquo panel) paste one sequence into each window and hit lsquoAlignrsquo ndash this will do a direct DNADNA comparison
Now find the open reading frames of the two genes and translate them into amino acid protein sequences then repeat the two sequences comparison
Go to NCBI ORF Finder ndash paste sequence ndash hit OrfFind ndash identify longest ORF ndash click on it ndash next screen hit Accept ndash change View to Fasta protein ndash hit View ndash copy sequence to Blast2Seqs Do the same with the other sequence
Before you hit lsquoAlignrsquo change the lsquoProgramrsquo (top left) to blastphellip
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Answers Exercise 1
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
The Essential Taskexperiment data mining
gene sequence what is its function
database of proteins in other species
Cyclin-AFoxA1
cdc25
alpha-tubulin
Predicted protein
Gravin-like
Sprouty-2
calmodulin
KIAA10786568
frizzled
Wint8
Troponin T3
Gravin-like
we can only do this because of implied function based on orthology
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Functional Orthologs
function known annotation lsquoGravinrsquo available
Human geneXenopus genefunction unknown
sequence similarityorthologs
same function But we know that function is largely determined by shape
similar shape
Which in general we cannot determine ndash but it is probably SHAPE not SEQUENCE that is conserved
We make an assumption that the same gene function is likely to be present in the two organisms and the ones that have this function are likely to be the most similar in sequence
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Finding OrthologsSo how do we find orthologs and can we know when we have
The simplest is Reciprocal Best BLAST but it implicitly relies on having all the protein sequences of you own organism and the one you wish to find an ortholog in
frog proteindatabase of human proteins
best match human protein
database of frog proteins
x
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Using Synteny is Better
We know that large regions of (say) vertebrate genomes have preserved their overall organisation from one organism to another
And we find the same genes (ie orthologs) in more or less the same order in the syntenic sectionsThese of course represent chromosomal re-arrangements since these organisms diverged
Human chromosome 5
Mouse chromosome 10
Mouse chromosome 2
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
MetazomeFortunately someone has done all the hard work for ushellip Dan Rokhsar httpwwwmetazomenet
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Metazome Exercise
Go back to Entrez Gene and look for your favourite gene again
Pick probable ortholog vertebrate genes from common organisms (human mouse rat chicken frog fish) and paste their protein sequences into a temporary space
Go to Metazome (httpwwwmetazomenet) find the blast window open two versions of it and blast your sequences against the Tetrapod or Jawed vertebrate node
See if you get the same cluster ID as best top hit and have a look at the Metazome alignment(s)hellip
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Part 3 Finding Sequence Similarities
We want computer programs which will compare sequences at all possible different alignments looking for a degree of similarity greater than we would expect to find by chance
But first we have to consider the implication of gapshellip
Insertions and deletions are other possible forms of mutations and they can really mess up our simple alignments
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Clearly although the alignment has no mismatches it is obviously not biologically meaningful
To prevent this we assign a cost to adding gaps which is offset against the benefit of finding matches ndash and this is the essence of lsquofinding gapped alignmentsrsquo
We want to find the lsquoalignmentrsquo between the two (or more) sequences which shows the greatest degree of similarity while introducing the fewest gaps hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
There are many programs used to find similarities between sequencesThey range from relatively slow programs which find the exact best matching alignment through ones which take progressively inexact shortcuts to speed things up Of this latter class the best known and easily most widely used is BLAST developed by Stephen Altschul and others and continuously refined over the last 10-15 years
The essential idea is to compare your query sequence against a collection or lsquodatabasersquo of target sequences looking for the one(s) that match the query sequence the best
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
How does it work The main task of any sequence comparison program is to test all possible mutual alignments of two sequence and see how good the match is
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Expand from Word Based Matches We lsquoinstantlyrsquo know which sequences in the database have at least a word length match with our query sequence and at what relative position
Next the potential alignments are expanded adding up a score for (total matches ndash mismatches ndash gap penalties) to make the best possible alignment But this is usually for a tiny proportion of the sequences in the database ndash so overall it is much quicker
The highest scoring alignments are reported
But we can potentially miss alignments with no word-size bits in common consider BLASTn with a default word-size of 11
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV F K S+ + C + + N Y N +C P+ + +W +P + D I N M ++ NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
Here is a lsquotypicalrsquo weak alignment from BLASTp
In fact the sequences were randomly generated so there is no biologically significant alignmenthellip
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-values
The number of matches like the discovered match that I would expect to find by chance
An E-value of 00 implies that I would expect no matches like this to arise by chance thereforehellip
An E-value of 1 implies I would expect 1 match like this to arise by chance so if I have a match with such an E-valuehellip
Also ldquoexpect valueldquo or ldquoexpectationrdquo
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-values From First Principles
Some database statistics (23rd July 2005)
Database NCBI RefSeq mRNA 272619 sequences 503566580 total letters (~50 x 108)
Database NCBI nr 3329110 sequences 14601814750 total letters (~14 x 1010)
Notation
12e-35 = 12 x 10-35
48 x 106 = 4800000
We will consider first searching a nucleotide sequence (lsquoACGTAGACGTrsquo) against a nucleotide database eg the RefSeq mRNA above
Then we will consider the more complex case of amino acid sequence (protein) searches Which is of course what we mostly do
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Calculating an E-valueThe RefSeq mRNA database has ~ 50 x 108 letters There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (50 x 108) 4 = ~12 x 108
Expected number of matches = (50 x 108) (4x 4) = ~31 x 107
Expected number of matches = (50 x 108) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (50 x 108) (4 x 4 x 4 x 4 hellip 60 times ) = (50 x 108) 1036 = 50 x 10-28
E-value = 50 x 10-28
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-values In PracticeSo if I take a 60 nt sequencegtsequenceACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA and actually BLAST it against the RefSeq mRNA database I get
What do I get if I BLAST it against the larger nr databaseBLAST OUTPUTgtgi|27469838|gb|BC0417101| Homo sapiens Rap guanine nucleotide exchange factor (GEF) 1 transcript variant 2 mRNA (cDNA clone MGC49019 IMAGE6051007) complete cds
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-value Exercise
Given a transcription factor binding site
ACC[TG]TA
How many would you expect to find by chance in a 10k promoter sequence
How would this differ if there was an optional additional base between the 4th and 5th positionsIeACC[TG]TAOR ACC[TG]TA
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-value Exercise AnswerACC[TG]TA
Expect lsquoArsquo every 4 ntExpect lsquoACCrsquo every 4x4x4 = 64 nt
Expect lsquoT or Grsquo every 2nd ntExpect lsquoACC[TG]rsquo every 64x2 nt = 128 nt
Expect lsquoTArsquo every 4x4 = 16 ntExpect lsquoACC[TG]TArsquo every 128x16 nt = 2048 nt (4x4x4x2x4x4)We would expect ~5 of these promoter sites every 10k by chance
If also ACC[TG]TAA allowed
The two motifs independently have the same E-valueTo allow either means we expect twice as many
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-values Effect of Database SizeThe nr mRNA database has ~ 14 x 1010 letters (was RefSeq and 50 x108)There are 4 possible nucleotides - ACGT How many matches do we expect to find by chance
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGA A A A A AA A A A AA AA
Expected number of matches = (14 x 1010) 4 = ~12 x 108
Expected number of matches = (14 x 1010) (4x 4) = ~31 x 107
Expected number of matches = (14 x 1010) (4 x 4 x 4) = ~81 x 106
Query = lsquoACGTCGAhellipCTGATTCGrsquo - 60-mer
Expected number of matches = (14 x 1010) (4 x 4 x 4 x 4 hellip 60 times ) = (14 x 1010) 1036 = 14 x 10-26
E-value = 14 x 10-26
(was E-value = 50 x 10-28)
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-values Effect of Database Size
The E-value is simply dependent on database size
RefSeq
nr
14 x 1010 letters
50 x108 letters
30 x bigger
BLAST the same sequenceagainst each E-value = 14e-26
E-value = 50e-28
The database was ~30 times bigger and so the E-value was ~30 times bigger
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Why were the values differentOur calculated E-value for searching against the RefSeq mRNA database was 50 x 10-28But our actual BLAST search at NCBI gave a value of
20 x 10-26 - about 40x larger - why is this
Gapped alignments
If we were expecting N matches for a query sequence lsquoACGTACGTACGTrsquo imagine what would happen to N if we allowed gaps in our matches
ACGTACGTACGT
This would now give us additional possible alignments that would meet our lsquomatchrsquo criteria
We will expect many more matches in a given database if we allow our alignments to have gaps The E-value will be larger
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
E-values Effect of Query Length
Biologically itrsquos the same match Does it mean we are any less sure that this match didnrsquot occur by chance The E-value is simply dependent on match length
database
BLAST 500 nt sequence against a database
BLASTn Get a full length match with sequence XYZ at an E-value = 50e-160
Get a match with sequence XYZ again but at an E-value = 50e-80
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Why not just use identityAt some levels this a good question
But consider two very different searches both of which give a 75 identity match
Query1 was 60 nt longCGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG||||||||||| || | || | || || |||| | | | |||||| | ||||||||||CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGWhich would have an E-value ~ 50 x 10-19
And Query2 only 16 nt longACGTACGTACGTACGT||| || | |||| ||ACGCACCTTCGTAGGTWhich would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in the database just by chancehellip
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
So whatrsquos the real problemBasically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
Are there any useful guidelines though at least for biological meaningfulness
Basically you are usually trying to answer the question
Can I find the ortholog of my gene in some other species so that I can work out what it might be doing in my organism
BLAST
The difficulty is because
ORTHOLOGY
BLAST Similarity + Probability
biological knowledge
nature of query sequence
phylogenetic relationship
match length PI size of databasehellip
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Rules of ThumbHow good does an E-value have to be before we might even think we have an ortholog
largerworse smallerbetter
E-values 10-5 10-10 10-40 10-100 00
fantasy borderline encouraging
pretty good canrsquot get
better
But note that in some gene families with closely related members you can get an E-value of 00 for several different matches and then identity may be more sensitive Also bear in mind in cases like this that ideas of lsquofunctionalrsquo orthology may break down with more than one locus producing identical proteins which share the same functionhellip
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Protein BLASTItrsquos (nearly) always better to make comparisons at the amino acid level between protein sequences than the DNA level because the amino acid sequence is more conserved than the underlying DNA sequenceDoes this cause us to treat expected values any differently
If we follow the argument as before then for an exact match of a 20 amino acid sequence in the RefSeq protein database each additional amino acid will reduce the E-value by 120th (there are 20 different amino acids) And as there are 347895532 letters in that databaseE-value = ~35 x 108 (20 x 20 x 20 hellip20 times) = ~35 x 10-18
But this is what we get if we run the blast at NCBI
Really too big a discrepancy to easily explain with hand wavinghellip
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Amino Acid Substitutions
A SC F LWYG I LMVL IMFVM ILVP V ILMW FY
N DHSQ REHKS ANTT SY HFW
H NQYK RQER QK
D NEE DQK
In fact we need to take into account both amino acid substitutability as well as as before allowing gapped alignments On average any residue can be substituted for by about 2 others so each position has about 17th chance of lsquomatchingrsquo rather than 120th
So now we getE-value = ~35 x 108 (7 x 7 x 7 hellip20 times) = ~44 x 10-9which is much closer to the actual BLAST value
These substitutabilities are dealt with by the BLOSUM and PAM matrices
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
ExercisesGo to the file random-DNA-sequenceshtml select one of the 20 randomly generated nucleotide sequences and do a BLASTx (translated DNA-gtprotein) at NCBI against the nr protein database
Did you find any lsquosignificantrsquo hits
Repeat with a second sequence
What conclusions might you draw from this exercise
Try the same sequence(s) against the nr nucleotide database
Is there any general difference
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Part 4 Tweaking BLASTAlthough you normally see BLAST as a web page with boxes to place data in and tick boxes etc it is actually a command line program that can be run just by typing the appropriate command and options eg
This is the simplest form where the basic program lsquoblastallrsquo takes a number of different options or parameters indicated by the ndashx and followed by its value -p ltwhich blast flavour to rungt-i ltfile with query sequence ingt-d ltpre-indexed database namegt
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Not All Parameters are hellipThere are many other parameters and if not listed explicitly they will use a default value most appropriate to the blast flavour requested Eg for ndashW ltword sizegt blastn uses ndashW 11 where blastx uses ndashW 3
There are also some options that appear on the web pages that are not really parameters but manage the job in a similar way One of the most useful of these is on the NCBI blast pages where you can use Entrez queries or pick from an organism list to modify your search
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
The Many Parameters of BLASTThere are almost literally hundreds of parameters but most are way too obscure even for die-hard techies like me Very few of them are regularly useful in any but their default value but just occasionally they are very necessary
Here are some of the ones that I have used
-e max expected value -m output format (graphical or tabularspreadsheet)-F filter query sequence for low complexity (default TRUE)-U use only upper case regions of query (default FALSE)-G gap opening cost-E gap extension cost-q nucleotide mismatch penalty (BLASTx uses matrices)-r nucleotide match reward-b number of matching sequences to report-g allow gaps (default TRUE)-W word size-z effective database size (removes effect of actual database size)-S query strands to search (default both directions)-l restrict database sequences to given list of lsquogilsquo numbers
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
BLAST Parameters Exercises1 BLASTn vs BLASTx
Open the file example-sequenceshtml copy the sequence gtblastn-vs-blastxThis is a Xenopus tropicalis cDNA sequence
Go to the NCBI BLAST Home PageNucleotide-nucleotide BLAST (blastn) section Paste your sequence into the box
Run BLASTn against the nr nucleotide database using all default optionsThen hit [format] to wait for the results in a new page(hint if you paste the sequence definition line lsquogtnamersquo into the box as well your results will be labelled accordingly which can be useful)
Now repeat but go to the TRANSLATED BLAST section and BLAST against the nr protein database using BLASTx
How might the different results help us view the presence of this gene in other vertebrates
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Open the file example-sequenceshtml copy the sequence gtlow-complexity-filtering-A This sequence contains a long AT tandem repeat
Go to the NCBI BLAST Home PageTRANSLATED BLAST sectionBLASTx Paste your sequence into the box
Carefully UNTICK the ldquoChoose filter [ ] Low complexityrdquo BOX in the second section And then run BLASTx against the nr database
What do you feel about these alignmentsRe-run but leave the low-complexity filter ON this timeDoes this change our view of the protein matches
Now continue with gtlow-complexity-filtering-B and ndashCC is an especially interesting case ndash what can we deduce about the cDNA sequence Annotators beware
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Results for Exercise 2A (OFF) BLASTn ndash low complexity filtering OFF
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Results for Exercise 2A (ON) BLASTn ndash low complexity filtering ON
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Results for Exercise 2B
ON OFF
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
Results for Exercise 2C
There is a sequence error an extra G at position 117 in the sequence cDNA (117)AGAAAAGAAGAAACATGGCAATGGATCAGAA|||||||||||||||| ||||||||||||||AGAAAAGAAGAAACAT-GCAATGGATCAGAA
Genomic sequence
ON
OFF
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
BLAST Parameters Exercises3 Limit by Entrez query
Entrez queries can be used in the NCBI BLAST web page to restrict the search to more specific items For instance to find only matching sequences in fruit fly enter lsquoDrosophila melanogaster[ORGN]rsquo in the Limit by entrez query box in the second section (you can also select the organism from the adjacent drop-down list) To combine items use logical AND OR or NOT
Open the file example-sequenceshtmlCopy the sequence gtcyclin-D1-Xt and go to the NCBI BLAST Home Page TRANSLATED BLAST sectionBLASTx and paste the sequence
Use an Entrez query to find all rodent sequences (rat and mouse) with a good match to cyclin-D1 At what E-value do we expect we are no longer looking at cyclins Try running the search again with that E-value as a limithellip
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip
Bioinformatics Workshop 1 Sequences and Similarity Searches
The Basic Questions
Part 1 Structural Genomics
Chromosomes and Genes
Gene to Protein
Sequence Signals
Genomic Signals
Derivative Sequences
Gene Models
Sequences and Genes (Accession Numbers and Names)
Gene Symbols Names Etc
A Gene-Centric View
Sequences and Accession Numbers
mRNA Splicing Signals
Gene Predictions
Supporting Evidence
TheoreticalPredicted Sequences
Sequences for a model organism
So Whatrsquos in the Databases Now
Part 2 Comparative Genomics
Speciation
Residual Similarity
Computers Can Detect Homology
Orthologs
Paralogs
lsquoOtherrsquo-logs
The Essential Paradigm
Function Conserved Longer than Detectable Similarity
Redundancy in the Genetic Code
Protein Similarity Persists Longer
Always Compare Protein Sequences
Exercise 1 nucleotide vs amino acid search
Answers Exercise 1
The Essential Task
Functional Orthologs
Finding Orthologs
Using Synteny is Better
Metazome
Metazome Exercise
Part 3 Finding Sequence Similarities
Gaps in Alignments
The Downside of Gaps
BLAST
Flavours of BLAST
How does it work
BLAST WORDS and INDEXING
Analyse the Query Sequence
Expand from Word Based Matches
BLAST ndashTypical Output
When is a match significant
E-values
E-values From First Principles
Calculating an E-value
E-values In Practice
E-value Exercise
E-value Exercise Answer
E-values Effect of Database Size
Slide 58
Why were the values different
E-values Effect of Query Length
Why not just use identity
So whatrsquos the real problem
Rules of Thumb
Protein BLAST
Amino Acid Substitutions
Exercises
Part 4 Tweaking BLAST
Not All Parameters are hellip
The Many Parameters of BLAST
Slide 70
BLAST Parameters Exercises
Results for Exercise 1
Slide 73
Results for Exercise 2A (OFF)
Results for Exercise 2A (ON)
Results for Exercise 2B
Results for Exercise 2C
Slide 78
Slide 79
Results for Exercise 4 (i)
Results for Exercise 4 (ii)
Slide 82
Slide 83
END
BLAST Parameters Exercises4 BLASTn vs tBLASTx and nucleotide mismatch penalties
Open the file example-sequenceshtml
Also open the NCBI BLAST Home PageSPECIAL ndash Align two sequences section
There are several Xenopus tropicalis cyclins in the examples fileCopy the sequence gtcyclin-A1-Xt to the Sequence 1 BLAST windowCopy the sequence gtcyclin-A2-Xt to the Sequence 2 BLAST window(i) Run the default comparison should be BLASTn Note the alignmentNow run again using tBLASTx ndash what does this do to our understanding of the relationship between these two sequences Are they homologs orthologs or paralogs ndash or none of these
(ii) Revert to BLASTn and try varying the values for mismatch penalties and gapping ndash start by reducing the mismatch penalty to -1 Then try reducing the gap open and gap extension penaltieshellipWhat do we learn from this
(iii) Now repeat the first parts of the exercise with cyclin-D1 in place of cyclin-A2hellip