Page 1
DATABASE SEARCH & SEQUENCE COMPARISON
• “ NOTHING IN BIOLOGY MAKES SENSE
EXCEPT IN LIGHT OF EVOLUTION ”
Theodosius Dobzhansky, 1970
• NOTHING IN COMPUTATIONAL BIOLOGY MAKES SENSE
EXCEPT IN LIGHT OF SEQUENCE COMPARISON
© SIMR Bioinformatics
- more or less literal description of daily practices
Page 2
SEQUENCE COMPARISON
• “ IN BIOMOLECULAR SEQUENCES
HIGH SEQUENCE SIMILARITY USUALLY IMPLIES
SIGNIFICANT FUNCTIONAL OR STRUCTURAL SIMILARITY
”
- “ The 1ST fact of sequence analysis ”, D. Gusfield,
1997.
• IN BIOMOLECULAR SEQUENCES,
HIGH SEQUENCE SIMILARITY USUALLY IMPLIES
EVOLUTIONARY RELATIONSHIP.
• INFERENCE OF EVOLUTIONARY RELATIONSHIP
USUALLY IS REQUIRED
FOR INFERENCE OF COMMON STRUCTURE / FUNCTION
Page 3
Pauling and Zuckerkandl, J. Theor. Biol., 1965
• PROTEINS AND NUCLEIC ACIDS CONTAIN
INFORMATION ABOUT EVOLUTION:
what the ancestral molecule was,
when it existed,
how it changed
• MANY OBSERVED CHANGES ARE NOT STRONGLY
SELECTED
“cryptic polymorphism” ( 1970+, Lewontin & Harris: 30% genes
in population are polymorphic; 1970s, Kimura: neutral theory )
“dormancy” of the whole genes and copies
Page 4
SEQUENCE COMPARISON
• SEQUENCES OF BIOPOLYMERS CONTAIN INFORMATION
ABOUT THEIR STRUCTURE, FUNCTION,
AND EVOLUTIONARY HISTORY
• COMPARISON OF RELATED SEQUENCES
IS THE MAJOR WAY OF EXTRACTING THIS INFORMATION
• RELATED BUT DIFFERENT TASKS
• FIND SEQUENCES THAT NEED TO BE ALIGNED
• ALIGN THEM
• EVALUATE STAT. SIGNIFICANCE OF THE ALIGNMENTS
• also issues of algorithm efficiency
Page 5
SOURCES OF MUTATION AND POLYMORPHISM
• POINT SUBSTITUTIONS AND SMALL INDELS
• ERRORS OF DNA REPLICATION
• ERRORS OF DNA REPAIR
• DNA REARRANGMENTS AT LONGER RANGE
• ALSO ERRORS OF REPAIR
• ERRORS OF RECOMBINATION
• LEGITIMATE RECOMBINATION PROCESSES
• GENE DUPLICATIONS
• PIECES OF GENES / PROTEINS MAY BE SHUFFLED
Page 6
HOMOLOGS AND THEIR SUBSETS
ORTHOLOGS ORTHOLOGS AND PARALOGS
PARALOGS
Page 7
HOMOLOGY ≡ COMMON ANCESTRY
• IT IS EITHER THERE OR IT IS NOT ( NO DEGREES )
• OBJECTION 1: WHAT IF ONLY HALF OF THE MOLECULE IS
HOMOLOGOUS ? - JUST SAY SO
• OBJECTION 2: WE MAY MEAN THE DEGREE OF CERTAINTY
THAT THEY ARE HOMOLOGOUS - 1. JUST SAY SO
2. SOME STATISTICIANS DO NOT LIKE IT EITHER
3. 60 % IDENTITY MAY CONFER 100 % BELIEF THAT
HOMOLOGY EXISTS
• ORTHOLOGY / PARALOGY IS ESTABLISHED AFTER
• “FUNCTIONAL HOMOLOGY” USUALLY DOES NOT MAKE SENSE
( CALL IT THE SAME FUNCTION )
Page 8
Gibbs and McIntyre, Eur.J.Biochem, 1970
Page 9
DIAGRAM OF IMMUNOGLOBULIN REPEATS
Page 10
gi|117687|sp|P27450|CX32_ARATH GAP JUNCTION CX32 PROTEIN (CON... 570 e-162 gi|15237062|ref|NP_195285.1| (NM_119725) protein kinase - lik... 187 1e-46 gi|18398350|ref|NP_565408.1| (NM_127276) putative protein kin... 107 1e-22 gi|15242204|ref|NP_197012.1| (NM_121512) serine/threonine spe... 102 3e-21 gi|15233058|ref|NP_189510.1| (NM_113789) protein kinase, puta... 102 4e-21 gi|15222437|ref|NP_172237.1| (NM_100631) protein kinase APK1A... 99 5e-20 gi|15239047|ref|NP_196702.1| (NM_121179) putative protein [Ar... 95 7e-19 gi|15241749|ref|NP_195849.1| (NM_120307) serine/threonine-spe... 91 9e-18
IS THERE CONNEXIN IN PLANTS ( 1992-1993) ?
Page 11
consensus 1 YELGEKLGSGAFGKVYKGKHKD-------TGEIVAIKILK----KRSLSEkk--krFLREIQILRRLS-HPNIVRLLGVFE--EDDHLYLVMEYMEGGDL 84 query 1 ------------------------------------------------------------------mlwHRNLVKLLGYCR--EDKALLLVYEFIPKEVL 32 1DAW_A 33 YEVVRKVGRGKYSEVFEGINVN-------NNEKCIIKILK----PVKKKK------IKREIKILQNLCgGPNIVKLLDIVRdqHSKTPSLIFEYVNNTDF 115 1FGI_A 23 LVLGKPLGEGAFGQVVLAEAIGldkdkpnRVTKVAVKMLKsdatEKDLSD------LISEMEMMKMIGkHKNIINLLGACTq--DGPLYVIVEYASKGNL 114 gi 6226547 311 IIMHNKLGGGQYGDVYEGYWKR-------HDCTIAVKALK----EDAMPLh----eFLAEAAIMKDLH-HKNLVRLLGVCT--HEAPFYIITEFMCNGNL 392 gi 125484 1078 VHFNEVIGRGHFGCVYHGTLLDnd----gKKIHCAVKSLN----RITDIGev--sqFLTEGIIMKDFS-HPNVLSLLGICLr-SEGSPLVVLPYMKHGDL 1165 gi 1730077 1289 LEFGQTIGKGFFGEVKRGYWR---------ETDVAIKIIY----RDQFKTksslvmFQNEVGILSKLR-HPNVVQFLGACTagGEDHHCIVTEWMGGGSL 1374 gi 125874 108 IQFIQKVGEGAFSEVWEGWWK---------GIHVAIKKLKiigdEEQFKEr-----FIREVQNLKKGN-HQNIVMFIGACY----KPACIITEYMAGGSL 188 gi 462606 3 LTLEEIIGIGGFGKVYRAFWI---------GDEVAVKAARhd-pDEDISQti--enVRQEAKLFAMLK-HPNIIALRGVCL--KEPNLCLVMEFARGGPL 87 gi 1346396 534 RKFKVELGRGESGTVYKGVLE--------DDRHVAVKKLEn---VRQGKEv-----FQAELSVIGRIN-HMNLVRIWGFCS--EGSHRLLVSEYVENGSL 614
consensus 85 FDYLRRNGLL---------------LSEKEAKKIALQILRG--LE-YLHSRG---IVHRDLKPENILLDEN-------------GTVKIADFG--LARK- 147 query 33 RVMFLRRNDP---------------FPWDLRIKIVICAARGpcVStQLTKRE---CIYRDLQVFHILLDLS--------------------YGavLSRVs 94 1DAW_A 116 KVLYPTLT-------------------DYDIRYYIYELLKA--LD-YCHSQG---IMHRDVKPHNVMIDHEl------------RKLRLIDWG--LAEF- 175 1FGI_A 115 REYLQARRppgleysynpshnpeeqlsSKDLVSCAYQVARG--ME-YLASKK---CIHRDLAARNVLVTED-------------NVMKIADFG--LARD- 192 gi 6226547 393 LEYLRRTDksl--------------lpPIILVQMASQIASG--MS-YLEARH---FIHRDLAARNCLVSEH-------------NIVKIADFG--LARF- 456 gi 125484 1166 RNFIRNEThn---------------ptVKDLIGFGLQVAKG--MK-YLASKK---FVHRDLAARNCMLDEK-------------FTVKVADFG--LARD- 1228 gi 1730077 1375 RQFLTDHFnll-------------eqnPHIRLKLALDIAKG--MN-YLHGWTp-pILHRDLSSRNILLDHNidpknpvvssrqdIKCKISDFG--LSRL- 1454 gi 125874 189 YNILHNPNsstpk----------vkysFPLVLKMATDMALG--LL-HLHSIT---IVHRDLTSQNILLDEL-------------GNIKISDFG--LSAE- 256 gi 462606 88 NRVLSGKRi-----------------pPDILVNWAVQIARG--MN-YLHDEAivpIIHRDLKSSNILILQKveng-----dlsnKILKITDFG--LARE- 159 gi 1346396 615 ANILFSEGgni-------------lldWEGRFNIALGVAKG--LA-YLHHEClewVIHCDVKPENILLDQA-------------FEPKITDFG--LVKL- 682
consensus 148 ---LESS--SYEKLTTFVGT----PEYM-APEVLE---G-RGYSSKVDVWSLGVILYELLTG----------------------KLPFPG------IDPL 205 query 95 gpwLVAM--EQQNREVHRGTakvhRRHI-KVMLLLeyiA-GHLYVKSVAFAFGVVLLEIMTGltahntkrprgqaenhlmrtyvmddkhtqtatpyythk 190 1DAW_A 176 ---YHP----GKEYNVRVAS----RYFK-GPELLV---DlQDYDYSLDMWSLGCMFAGMIFRkepffyghdnhdqlvkiakvlgTDGLNVylnkyrIELD 260 1FGI_A 193 ---IHHi--dYYKKTTNGRL----PVKWmAPEALF---D-RIYTHQSDVWSFGVLLWEIFTLg---------------------GSPYPG-------VPV 251 gi 6226547 457 ---MKEd--tYTAHAGAKFP----IKWT-APEGLA---F-NTFSSKSDVWAFGVLLWEIATYg---------------------MAPYPG-------VEL 514 gi 125484 1229 ---MYDkeyySVHNKTGAKL----PVKWmALESLQ---T-QKFTTKSDVWSFGVVLWELMTRg---------------------APPYPD------VNTF 1290 gi 1730077 1455 ---KKEq---ASQMTQSVGC----IPYM-APEVFK---G-DSNSEKSDVYSYGMVLFELLTS----------------------DEPQQD------MKPM 1511 gi 125874 257 ---KSReg-sMTMTNGGICN----PRWR-PPELTK---NlGHYSEKVDVYCFSLVVWEILTG----------------------EIPFSD------LDGS 316 gi 462606 160 ---WH-----RTTKMSAAGT----YAWM-APEVIR---A-SMFSKGSDVWSYGVLLWELLTG----------------------EVPFRG------IDGL 214 gi 1346396 683 ---LNRgg-sTQNVSHVRGT----LGYI-APEWVS---S-LPITAKVDVYSYGVVLLELLTGtrvse-------------lvggTDEVHSmlrklvRMLS 756
consensus 206 EELFRIKERP-------RLRLPLPPNCSEELKDLIKKCLNKDPEKRPTAKEILNHPWF 256 query 191 rteieeqnneikginkvnhnqrvagtrlqfalrhytlllviepdpknqtthegsrsks 248 1DAW_A 261 PQLEALVGRHsrkpwlkFMNADNQHLVSPEAIDFLDKLLRYDHQERLTALEAMTHPYF 318 1FGI_A 252 EELFKLLKEG--------HRMDKPSNCTNELYMMMRDCWHAVPSQRPTFKQLVEDLdr 301 gi 6226547 515 SNVYGLLENG--------FRMDGPQGCPPSVYRLMLQCWNWSPSDRPRFRDIHFNLen 564 gi 125484 1291 DITVYLLQG---------RRLLQPEYCPDPLYEVMLKCWHPKAEMRPSFSELVSRIsa 1339 gi 1730077 1512 KMAHLAAYES--------YRPPIPLTTSSKWKEILTQCWDSNPDSRPTFKQIIVHLke 1561 gi 125874 317 QRSAQVAYAG--------LRPPIPEYCDPELKLLLTQCWEADPNDRPPFTYIVNKLke 366 gi 462606 215 RVAYGVAMN--------KLALPIPSTCPEPFAKLMEDCWNPDPHSRPSFTNILDQLtt 264 gi 1346396 757 AKLEGEEQSWidgyldsKLNRPVNYVQARTLIKLAVSCLEEDRSKRPTMEHAVQTLls 814
Page 12
• ARBITRARY ALIGNMENTS ( esp. ARBITRARY GAPS )
FAIL TO RETRIEVE RIGHT SIGNALS
• DATABASE SEARCH IS MUCH LESS ARBITRARY
• COMPUTER ANALYSIS MAY BE VIEWED AS A FALSIFICATION
EXPERIMENT OF WET-LAB “RESULTS”
• CX32 IN PLANTS IS PROTEIN KINASE NOT CONNEXIN
… in 1992 , this could all be figured out , but in 1970 ?
CONNEXIN IN PLANTS : CONCLUSIONS
Page 13
-BARNEYBRITNEY
BA-RNEYBRITNEY
BAR--NEYB-RITNEY
BAR--NEY-BRITNEY
BARNEY AND BRITNEY - A CONNECTION ?
• USEFUL FOR ANNOYING PARENTS
• HARD-WORKING
• SING A LOT
Page 14
DYNAMIC PROGRAMMING (aka DYN. PLANNING) 1 1 1 1
1
1
1
1
2 3 4 5
3 6 10 15
4 10 20 35
5 15 35 70
1 1 1 1
1
1
0
0
2 3 4 5
3 6 10 15
3 9 19 34
3 12 31 65
1 1 1 1
1
1
0
0
2 3 1 2
3 6 7 9
3 9 16 25
3 12 28 53
Page 15
DYNAMIC P - Needleman and Wunsch ( 1970 )
1 0/1 0/1 0/1 0/1
0/1 0/1 0/1 0/1 0/1
0/1 1/2 0/2 0/2 0/2
0/1 0/2 0/2 0/2 1/3
B R I T N
B
A
R
N
BAR--NEYB-RITNEY
• CAN BE AUTOMATED
• ASKS EVERY AMINO ACID TO BE ALIGNED WITH SOMETHING
• DOES NOT TELL WHETHER SEQUENCES ARE RELATED
• OUTCOME IS DEPENDENT ON HOW MATCHES AND MISMATCHES ARE SCORED
Page 16
-BARNEYBRITNEYBA-RNEYBRITNEYBAR--NEYB-RITNEYBAR--NEY-BRITNEY
BARNEY AND BRITNEY - HOW TO QUANTIFY ?
M/R/ID/IDend 1/0/0/0 1/0/-1/0 1/0/-2/-1
3 3 3
4 3 2
5 2 1
4 2 1
Each type of match and each type of mismatch to be scored differently !
Other ideas ?
Page 17
SUBSTITUTION MATRICES
• SET OF VALUES IN THE FORM OF 20*20 (AMINO ACIDS)
OR 4*4 (NUCLEOTIDES) MATRIX
• “ SCORE OF CHANGING i TO j ”
• OR, MORE COMMONLY, ONE HALF OF SUCH MATRIX
• “ SCORE OF ALIGNING i TO j ”
ARNDCQEGHILKMFPSTWY
V
4-1 5-2 0 6-2 -2 1 6 0 -3 -3 -3 9-1 1 0 0 -3 5-1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6-2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Page 18
HOW TO DERIVE SCORE VALUES
• “ FIRST PRINCIPLES ”
• HOW MANY MUTATIONS ARE REQUIRED TO CHANGE i TO j
• CAN CALCULATE FROM GENETIC CODE, BUT IMPLIES FUNNY
THINGS ABOUT EVOLUTION
• “CHEMICAL ISOFUNCTIONALITY”
STL IV M
DE
AGF YW
TS
Y D E
A G
L I V M F
W
• MUCH, MUCH BETTER - BASED ON THE OBSERVED
FREQUENCIES OF SUBSTITUTIONS
Page 19
PAM, BLOSUM, ... - ALL SHOULD BE LOG-ODDS
• SCORE FOR ALIGNING i TO j Sij = log (qij/pipj)
p – background frequencies, q – target frequencies: how to get them?
• PAM (= point accepted mutations) - Dayhoff, 1968
• alignments of 85% identical proteins, 71 families, mostly animal
• used model of evolutionary change with many assumptions
• directly observed data are at short evolutionary distances
• for more distant relationships, multiply matrix by itself - e.g. PAM 120
• BLOSUM (= summary of BLOCKS) – the Henikoffs, 1992
• 500 families, more members, and they are more diverse
• but most importantly – conservation is of a different type!
Page 20
BLOCKS : THIS IS THE WAY PROTEINS LIVE
YLHSRG IVHRDLKPENILLDENQLTKRE CIYRDLQVFHILLDLS YCHSQG IMHRDVKPHNVMIDHE YLASKK CIHRDLAARNVLVTED YLEARH FIHRDLAARNCLVSEH YLASKK FVHRDLAARNCMLDEK YLHGWT ILHRDLSSRNILLDHN HLHSIT IVHRDLTSQNILLDEL YLHDEA IIHRDLKSSNILILQK YLHHEC VIHCDVKPENILLDQA
consensus 85 FDYLRRNGLL---------------LSEKEAKKIALQILRG--LE-YLHSRG---IVHRDLKPENILLDEN-------------GTVKIADFG--LARK- 147
query 33 RVMFLRRNDP---------------FPWDLRIKIVICAARGpcVStQLTKRE---CIYRDLQVFHILLDLS--------------------YGavLSRVs 94
1DAW_A 116 KVLYPTLT-------------------DYDIRYYIYELLKA--LD-YCHSQG---IMHRDVKPHNVMIDHEl------------RKLRLIDWG--LAEF- 175
1FGI_A 115 REYLQARRppgleysynpshnpeeqlsSKDLVSCAYQVARG--ME-YLASKK---CIHRDLAARNVLVTED-------------NVMKIADFG--LARD- 192
gi 6226547 393 LEYLRRTDksl--------------lpPIILVQMASQIASG--MS-YLEARH---FIHRDLAARNCLVSEH-------------NIVKIADFG--LARF- 456
gi 125484 1166 RNFIRNEThn---------------ptVKDLIGFGLQVAKG--MK-YLASKK---FVHRDLAARNCMLDEK-------------FTVKVADFG--LARD- 1228
gi 1730077 1375 RQFLTDHFnll-------------eqnPHIRLKLALDIAKG--MN-YLHGWTp-pILHRDLSSRNILLDHNidpknpvvssrqdIKCKISDFG--LSRL- 1454
gi 125874 189 YNILHNPNsstpk----------vkysFPLVLKMATDMALG--LL-HLHSIT---IVHRDLTSQNILLDEL-------------GNIKISDFG--LSAE- 256
gi 462606 88 NRVLSGKRi-----------------pPDILVNWAVQIARG--MN-YLHDEAivpIIHRDLKSSNILILQKveng-----dlsnKILKITDFG--LARE- 159
gi 1346396 615 ANILFSEGgni-------------lldWEGRFNIALGVAKG--LA-YLHHEClewVIHCDVKPENILLDQA-------------FEPKITDFG--LVKL- 682
• BLOCKS ARE REGIONS WITH HIGH STRUCTURAL, FUNCTIONAL, AND EVOLUTIONARY SIGNAL ( or signal-to-noise ratio )
Page 21
A R N D C Q E G H I L K M F P S T W Y V
ARNDCQEGHILKMFPSTWYV
4-1 5-2 0 6-2 -2 1 6 0 -3 -3 -3 9-1 1 0 0 -3 5-1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6-2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
LOG-ODDS: RANDOM SCORES ARE NEGATIVE
Page 22
RANDOM SEQUENCES ( per S. Altschul ) :
ALIGN TWO SEQUENCES WITH SCORE S (this is called HSP)
- IS THIS SCORE SIGNIFICANTLY DIFFERENT FROM ALIGNING
TO A RANDOM SEQUENCE ? - where RANDOM may be
• COMPUTER – GENERATED ( perhaps with assumptions )
• THEMSELVES BUT SHUFFLED ( Z-scores in many programs )
• REAL BUT UNRELATED SEQUENCES ( e.g. all database )
- and SIGNIFICANT is …..
Page 23
BLAST STATISTICS
E = Kmn e-S
E - the expected number of HSPs with score S or higher observed by chance, given the size and complexity of database
m and n – effective lengths of database and query
– parameter from the substitution matrix ( precomputed )
K – parameter from the search space (length+complexity)
Raw Score S : sum of scores in all aligned positions ( matches and mismatches) minus gap penalties
Bit Score S’ : get rid of , reset the log base S’ = log2 K/E + log2 mn
Page 25
WHAT IS IN THE BLAST SCORE ?
Bit Score : S’ = log2 K/E + log2 mn
- usually dominated by log2 mn ,
i.e. score distinguishing chance from non-chance is the number of binary choices to map the HSP ( 40 – 45 )
SCORE : DIRECT MEASURE INDEPENDENT OF THE DB ( IF BITS )
STAYS THE SAME WHEN SEQUENCES ARE FLIPPED
E and P VALUES : CALCULATE KNOWING THE SCORE DEPENDENT OF THE DB SIZE NON-SYMMETRICAL EVEN WITH UNGAPPED HSPs
HOMOLOGY STILL HAS TO BE INFERRED
THE OPPOSITE IS NOT TRUE: LOW S ≠ NO HOMOLOGY
Page 26
touches ATP-phosphate
touches Mg++
Page 29
YNIVAQARTGSGKTASFAIPL
Y
Page 30
ARNDCQEGHILKMFPSTWYV
0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.3 0 0 0.3 0.1 0 0 0.7 0 0 0 0 0 0 0 0 0 0 0 0
T-G-[GSAT]-G-K-[ST]
Page 31
ALL ABOVE ARE PROBABILISTIC MODELS
• IN EACH POSITION, EXPECTED OR OBSERVED FREQUENCY
OF EACH AMINO ACID
• ALIGNMENTS , PROFILES , REGULAR EXPRESSIONS , PSSMs ,
HMMs , etc. ARE ALL INCARNATIONS OF THE SAME IDEA
• ALL OF THE ABOVE CAN BE MATCHED TO EACH OTHER OR
TO A SINGLE SEQUENCE , USING VARIATIONS OF A SCORING
FUNCTION
• AFFORD BETTER SENSITIVITY AND SELECTIVITY THAN ONE
SEQUENCE
Page 39
• HMM IS FOR HIDDEN MARKOV MODEL : WHAT IS HIDDEN ?
• “ OCCASIONALLY DISHONEST CASINO ”
142326546665562262143165
the state of the die is ‘hidden’, but can be revealed
142326546665562262143165
142326546665562262143165
• GIVEN: SEQUENCE; ALIGNMENT; PROBABILITIES OF CHANGES
DETERMINE: IS SEQUENCE PRODUCED BY EVOLUTION OF THE
FAMILY THAT MAKES UP THIS ALIGNMENT ??
Page 42
SOURCES AND ACKNOWLEDGEMENTS
• King Jordan’s class: http://jhunix.hcf.jhu.edu/~kjordan6/
• Sean Eddy’s: http://bio5495.wustl.edu/
• Steve Altschul’s tutorial:
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html