Pairwise sequence alignments & BLAST
Thepointofsequencealignment
• Ifyouhavetwoormoresequences,youmaywanttoknow– Howsimilararethey?(AquanCtaCvemeasure)– Whichresiduescorrespondtoeachother?– IsthereapaGerntotheconservaCon/variabilityofthesequences?
– WhataretheevoluConaryrelaConshipsofthesesequences?
BLAST• BasicLocalAlignmentSearchTool• Altschul,etal1990• Hasbeencitedover61,000CmesaccordingtoGoogle
• ThemosthighlycitedscienCficpaperintheenCredecadeofthe1990s
BLAST
• ComparesaQUERYsequencetoaDATABASEofsequences(alsocalledSUBJECTsequences)
• nucleoCdeorproteinsequences• CalculatesstaCsCcalsignificance• Availableasanonlinewebserver,forexample,atNCBI(hGp://blast.ncbi.nlm.nih.gov/Blast.cgi)
BLASTprogramsProgram Query Database
blastp protein protein
blastn nucleoCde nucleoCde
blastxnucleoCdetranslatedtoprotein
protein
tblastn proteinnucleoCdetranslatedtoprotein
tblastxnucleoCdetranslatedtoprotein
nucleoCdetranslatedtoprotein
WhywouldwewanttousetranslatednucleoCdes?
BLAST
• Alsoavailableasacommandlinetool(guesswhichonewe’llbeusing???)
• Needtoconquersomebasicconcepts– Alignment– Scoringanalignment– SubsCtuConmatrices
String A = a b c d eString B = a c d e f
A (good) alignment would be:
String A = a b c d e – | | | |String B = a - c d e f
Alignment
Manyalignmentsarepossible,wewanttofindthebest
g c t g a a c gc t a t a a t c
Bad:
g c t g a a c g - - - - - - -- - - - - - - c t a t a a t c
Manyalignmentsarepossible,wewanttofindthebest
g c t g a a c gc t a t a a t c
Better?
g c t g - a a - c g | | | | | - c t a t a a t c
Todecidewhichalignmentisbestweneed- Awaytoexamineallpossiblealignments
- Awaytocomputeascorethatgivesthequalityofthealignment
Scoringsequencesimilarity
• Asimplescheme+1foramatch-1foramismatch
String A = a b c d e | | | |String B = a c c d e
+ 4- 1
Total Score: 3
ScoringbasedonBiology
• NucleoCdesarenotmutatedrandomly• TransiConmutaConsaremorecommon– Purine(A/G)topurine(A/G)– Pyrimidine(C/T)topyrimidine(C/T)
• TransversionmutaConsarelesscommon• Canbuildascoringschemetoreflectthis:– Residueisthesame=+1– ResidueundergoestransiCon=0– Residueundergoestransversion=-1
ScoringBasedonBiology
• AminoAcidsarenotmutatedatrandomeither
• Thoseofsimilarphysicochemicaltypesaremorelikelytoreplaceeachother
• Insteadofguessingwhattheseratesmightbe,canmeasureempirically
ScoringBasedonBiology
• MargaretDayhoff(1978)– CollectedstaCsCcsonproteinsubsCtuConfrequencies
– BuiltthefirstsetofproteinsubsCtuConmatrices
– PointacceptedmutaCon(PAM)matrices
BLOSUM
• BLOSUM(BLOckSUbsCtuConMatrix)-HenikoffandHenikoff
• AnewsubsCtuConmatrix,preferredtoday• MuchbeGerformoredivergentspecies(constructedusingdivergentspeciesalignments)
• BLOSUM62isthematrixusedbydefaultinmostrecentalignmentapplicaConssuchasBLAST.
BLOSUM62A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
ScoringGaps
• Whataboutgaps?• Usually,agapopeningismoreofapenaltythanagapextension
• Why?AsinglemutaConalevenmayinsertmorethanonebase.
• Commonlyusedistheaffinegappenalty:Gapopeningpenaltyof11Gapextensionpenaltyof1foreachaddiConalresidue
ScoringWrapUp
• NowwehavegoodawaytoscoreaparCcularalignment1.ScoresubsCtuConsappropriatelyreflecCngbiology2.ScoregapsappropriatelyreflecCngbiology
• Buthowtogenerateallthepossiblealignments?
ApproximateMethods• Needmorespeed!• Approximatemethodshavebeendevelopedthatare
– GreatatdetecCngcloserelaConships– InferiortoexactmethodsforpickingupdistantrelaConships– Approximate!(IEnoguaranteethattheopCmalmatchisfound)
• StartwithidenCcal“words”– Calledk-tuplesork-mers– Usethesewordstoquicklyfindperfectmatches– Thenusethemoreslowmethodstogrowthematches
• BLASTworksthiswayHeurisCc–anythatemploysapracCcalmethodologynotguaranteedtobeopCmalorperfect,butsufficientforthe
immediategoals
SignificanceofAlignments• Nowwecanfindthebestscoringalignment(or
atleastapproximatelyifusingBLAST)• ButisitsignificantinthestaCsCcalsense?
– Whatisthelikelihoodthatyouareobservingtruebiologicalsimilarity(evoluCon)vsrandomchance?
• E(expect)value=thenumberofhitsonecan"expect"toseebychancewhensearchingadatabaseofaparCcularsize
• Takesintoaccountthesizeofthedatabasebutnotthenumberofqueries(bewareofmulCpletesCng!)
• Lower=morebiologicallymeaningful
Evalues
EValue Howmanyrandomalignmentsjustasgood?
1 1in1.2 1in51e-5 1in100,0001e-9 1in1,000,000,0000 0%