Top Banner
DATABASE SEARCH & SEQUENCE COMPARISON • “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF SEQUENCE COMPARISON © SIMR Bioinformatics - more or less literal description of daily practices
42

DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

Dec 23, 2015

Download

Documents

Stuart Holmes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

DATABASE SEARCH & SEQUENCE COMPARISON

• “ NOTHING IN BIOLOGY MAKES SENSE

EXCEPT IN LIGHT OF EVOLUTION ”

Theodosius Dobzhansky, 1970

• NOTHING IN COMPUTATIONAL BIOLOGY MAKES SENSE

EXCEPT IN LIGHT OF SEQUENCE COMPARISON

© SIMR Bioinformatics

- more or less literal description of daily practices

Page 2: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

SEQUENCE COMPARISON

• “ IN BIOMOLECULAR SEQUENCES

HIGH SEQUENCE SIMILARITY USUALLY IMPLIES

SIGNIFICANT FUNCTIONAL OR STRUCTURAL SIMILARITY

- “ The 1ST fact of sequence analysis ”, D. Gusfield,

1997.

• IN BIOMOLECULAR SEQUENCES,

HIGH SEQUENCE SIMILARITY USUALLY IMPLIES

EVOLUTIONARY RELATIONSHIP.

• INFERENCE OF EVOLUTIONARY RELATIONSHIP

USUALLY IS REQUIRED

FOR INFERENCE OF COMMON STRUCTURE / FUNCTION

Page 3: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

Pauling and Zuckerkandl, J. Theor. Biol., 1965

• PROTEINS AND NUCLEIC ACIDS CONTAIN

INFORMATION ABOUT EVOLUTION:

what the ancestral molecule was,

when it existed,

how it changed

• MANY OBSERVED CHANGES ARE NOT STRONGLY

SELECTED

“cryptic polymorphism” ( 1970+, Lewontin & Harris: 30% genes

in population are polymorphic; 1970s, Kimura: neutral theory )

“dormancy” of the whole genes and copies

Page 4: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

SEQUENCE COMPARISON

• SEQUENCES OF BIOPOLYMERS CONTAIN INFORMATION

ABOUT THEIR STRUCTURE, FUNCTION,

AND EVOLUTIONARY HISTORY

• COMPARISON OF RELATED SEQUENCES

IS THE MAJOR WAY OF EXTRACTING THIS INFORMATION

• RELATED BUT DIFFERENT TASKS

• FIND SEQUENCES THAT NEED TO BE ALIGNED

• ALIGN THEM

• EVALUATE STAT. SIGNIFICANCE OF THE ALIGNMENTS

• also issues of algorithm efficiency

Page 5: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

SOURCES OF MUTATION AND POLYMORPHISM

• POINT SUBSTITUTIONS AND SMALL INDELS

• ERRORS OF DNA REPLICATION

• ERRORS OF DNA REPAIR

• DNA REARRANGMENTS AT LONGER RANGE

• ALSO ERRORS OF REPAIR

• ERRORS OF RECOMBINATION

• LEGITIMATE RECOMBINATION PROCESSES

• GENE DUPLICATIONS

• PIECES OF GENES / PROTEINS MAY BE SHUFFLED

Page 6: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

HOMOLOGS AND THEIR SUBSETS

ORTHOLOGS ORTHOLOGS AND PARALOGS

PARALOGS

Page 7: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

HOMOLOGY ≡ COMMON ANCESTRY

• IT IS EITHER THERE OR IT IS NOT ( NO DEGREES )

• OBJECTION 1: WHAT IF ONLY HALF OF THE MOLECULE IS

HOMOLOGOUS ? - JUST SAY SO

• OBJECTION 2: WE MAY MEAN THE DEGREE OF CERTAINTY

THAT THEY ARE HOMOLOGOUS - 1. JUST SAY SO

2. SOME STATISTICIANS DO NOT LIKE IT EITHER

3. 60 % IDENTITY MAY CONFER 100 % BELIEF THAT

HOMOLOGY EXISTS

• ORTHOLOGY / PARALOGY IS ESTABLISHED AFTER

• “FUNCTIONAL HOMOLOGY” USUALLY DOES NOT MAKE SENSE

( CALL IT THE SAME FUNCTION )

Page 8: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

Gibbs and McIntyre, Eur.J.Biochem, 1970

Page 9: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

DIAGRAM OF IMMUNOGLOBULIN REPEATS

Page 10: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

gi|117687|sp|P27450|CX32_ARATH GAP JUNCTION CX32 PROTEIN (CON... 570 e-162 gi|15237062|ref|NP_195285.1| (NM_119725) protein kinase - lik... 187 1e-46 gi|18398350|ref|NP_565408.1| (NM_127276) putative protein kin... 107 1e-22 gi|15242204|ref|NP_197012.1| (NM_121512) serine/threonine spe... 102 3e-21 gi|15233058|ref|NP_189510.1| (NM_113789) protein kinase, puta... 102 4e-21 gi|15222437|ref|NP_172237.1| (NM_100631) protein kinase APK1A... 99 5e-20 gi|15239047|ref|NP_196702.1| (NM_121179) putative protein [Ar... 95 7e-19 gi|15241749|ref|NP_195849.1| (NM_120307) serine/threonine-spe... 91 9e-18

IS THERE CONNEXIN IN PLANTS ( 1992-1993) ?

Page 11: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

consensus 1 YELGEKLGSGAFGKVYKGKHKD-------TGEIVAIKILK----KRSLSEkk--krFLREIQILRRLS-HPNIVRLLGVFE--EDDHLYLVMEYMEGGDL 84 query 1 ------------------------------------------------------------------mlwHRNLVKLLGYCR--EDKALLLVYEFIPKEVL 32 1DAW_A 33 YEVVRKVGRGKYSEVFEGINVN-------NNEKCIIKILK----PVKKKK------IKREIKILQNLCgGPNIVKLLDIVRdqHSKTPSLIFEYVNNTDF 115 1FGI_A 23 LVLGKPLGEGAFGQVVLAEAIGldkdkpnRVTKVAVKMLKsdatEKDLSD------LISEMEMMKMIGkHKNIINLLGACTq--DGPLYVIVEYASKGNL 114 gi 6226547 311 IIMHNKLGGGQYGDVYEGYWKR-------HDCTIAVKALK----EDAMPLh----eFLAEAAIMKDLH-HKNLVRLLGVCT--HEAPFYIITEFMCNGNL 392 gi 125484 1078 VHFNEVIGRGHFGCVYHGTLLDnd----gKKIHCAVKSLN----RITDIGev--sqFLTEGIIMKDFS-HPNVLSLLGICLr-SEGSPLVVLPYMKHGDL 1165 gi 1730077 1289 LEFGQTIGKGFFGEVKRGYWR---------ETDVAIKIIY----RDQFKTksslvmFQNEVGILSKLR-HPNVVQFLGACTagGEDHHCIVTEWMGGGSL 1374 gi 125874 108 IQFIQKVGEGAFSEVWEGWWK---------GIHVAIKKLKiigdEEQFKEr-----FIREVQNLKKGN-HQNIVMFIGACY----KPACIITEYMAGGSL 188 gi 462606 3 LTLEEIIGIGGFGKVYRAFWI---------GDEVAVKAARhd-pDEDISQti--enVRQEAKLFAMLK-HPNIIALRGVCL--KEPNLCLVMEFARGGPL 87 gi 1346396 534 RKFKVELGRGESGTVYKGVLE--------DDRHVAVKKLEn---VRQGKEv-----FQAELSVIGRIN-HMNLVRIWGFCS--EGSHRLLVSEYVENGSL 614

consensus 85 FDYLRRNGLL---------------LSEKEAKKIALQILRG--LE-YLHSRG---IVHRDLKPENILLDEN-------------GTVKIADFG--LARK- 147 query 33 RVMFLRRNDP---------------FPWDLRIKIVICAARGpcVStQLTKRE---CIYRDLQVFHILLDLS--------------------YGavLSRVs 94 1DAW_A 116 KVLYPTLT-------------------DYDIRYYIYELLKA--LD-YCHSQG---IMHRDVKPHNVMIDHEl------------RKLRLIDWG--LAEF- 175 1FGI_A 115 REYLQARRppgleysynpshnpeeqlsSKDLVSCAYQVARG--ME-YLASKK---CIHRDLAARNVLVTED-------------NVMKIADFG--LARD- 192 gi 6226547 393 LEYLRRTDksl--------------lpPIILVQMASQIASG--MS-YLEARH---FIHRDLAARNCLVSEH-------------NIVKIADFG--LARF- 456 gi 125484 1166 RNFIRNEThn---------------ptVKDLIGFGLQVAKG--MK-YLASKK---FVHRDLAARNCMLDEK-------------FTVKVADFG--LARD- 1228 gi 1730077 1375 RQFLTDHFnll-------------eqnPHIRLKLALDIAKG--MN-YLHGWTp-pILHRDLSSRNILLDHNidpknpvvssrqdIKCKISDFG--LSRL- 1454 gi 125874 189 YNILHNPNsstpk----------vkysFPLVLKMATDMALG--LL-HLHSIT---IVHRDLTSQNILLDEL-------------GNIKISDFG--LSAE- 256 gi 462606 88 NRVLSGKRi-----------------pPDILVNWAVQIARG--MN-YLHDEAivpIIHRDLKSSNILILQKveng-----dlsnKILKITDFG--LARE- 159 gi 1346396 615 ANILFSEGgni-------------lldWEGRFNIALGVAKG--LA-YLHHEClewVIHCDVKPENILLDQA-------------FEPKITDFG--LVKL- 682

consensus 148 ---LESS--SYEKLTTFVGT----PEYM-APEVLE---G-RGYSSKVDVWSLGVILYELLTG----------------------KLPFPG------IDPL 205 query 95 gpwLVAM--EQQNREVHRGTakvhRRHI-KVMLLLeyiA-GHLYVKSVAFAFGVVLLEIMTGltahntkrprgqaenhlmrtyvmddkhtqtatpyythk 190 1DAW_A 176 ---YHP----GKEYNVRVAS----RYFK-GPELLV---DlQDYDYSLDMWSLGCMFAGMIFRkepffyghdnhdqlvkiakvlgTDGLNVylnkyrIELD 260 1FGI_A 193 ---IHHi--dYYKKTTNGRL----PVKWmAPEALF---D-RIYTHQSDVWSFGVLLWEIFTLg---------------------GSPYPG-------VPV 251 gi 6226547 457 ---MKEd--tYTAHAGAKFP----IKWT-APEGLA---F-NTFSSKSDVWAFGVLLWEIATYg---------------------MAPYPG-------VEL 514 gi 125484 1229 ---MYDkeyySVHNKTGAKL----PVKWmALESLQ---T-QKFTTKSDVWSFGVVLWELMTRg---------------------APPYPD------VNTF 1290 gi 1730077 1455 ---KKEq---ASQMTQSVGC----IPYM-APEVFK---G-DSNSEKSDVYSYGMVLFELLTS----------------------DEPQQD------MKPM 1511 gi 125874 257 ---KSReg-sMTMTNGGICN----PRWR-PPELTK---NlGHYSEKVDVYCFSLVVWEILTG----------------------EIPFSD------LDGS 316 gi 462606 160 ---WH-----RTTKMSAAGT----YAWM-APEVIR---A-SMFSKGSDVWSYGVLLWELLTG----------------------EVPFRG------IDGL 214 gi 1346396 683 ---LNRgg-sTQNVSHVRGT----LGYI-APEWVS---S-LPITAKVDVYSYGVVLLELLTGtrvse-------------lvggTDEVHSmlrklvRMLS 756

consensus 206 EELFRIKERP-------RLRLPLPPNCSEELKDLIKKCLNKDPEKRPTAKEILNHPWF 256 query 191 rteieeqnneikginkvnhnqrvagtrlqfalrhytlllviepdpknqtthegsrsks 248 1DAW_A 261 PQLEALVGRHsrkpwlkFMNADNQHLVSPEAIDFLDKLLRYDHQERLTALEAMTHPYF 318 1FGI_A 252 EELFKLLKEG--------HRMDKPSNCTNELYMMMRDCWHAVPSQRPTFKQLVEDLdr 301 gi 6226547 515 SNVYGLLENG--------FRMDGPQGCPPSVYRLMLQCWNWSPSDRPRFRDIHFNLen 564 gi 125484 1291 DITVYLLQG---------RRLLQPEYCPDPLYEVMLKCWHPKAEMRPSFSELVSRIsa 1339 gi 1730077 1512 KMAHLAAYES--------YRPPIPLTTSSKWKEILTQCWDSNPDSRPTFKQIIVHLke 1561 gi 125874 317 QRSAQVAYAG--------LRPPIPEYCDPELKLLLTQCWEADPNDRPPFTYIVNKLke 366 gi 462606 215 RVAYGVAMN--------KLALPIPSTCPEPFAKLMEDCWNPDPHSRPSFTNILDQLtt 264 gi 1346396 757 AKLEGEEQSWidgyldsKLNRPVNYVQARTLIKLAVSCLEEDRSKRPTMEHAVQTLls 814

Page 12: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

• ARBITRARY ALIGNMENTS ( esp. ARBITRARY GAPS )

FAIL TO RETRIEVE RIGHT SIGNALS

• DATABASE SEARCH IS MUCH LESS ARBITRARY

• COMPUTER ANALYSIS MAY BE VIEWED AS A FALSIFICATION

EXPERIMENT OF WET-LAB “RESULTS”

• CX32 IN PLANTS IS PROTEIN KINASE NOT CONNEXIN

… in 1992 , this could all be figured out , but in 1970 ?

CONNEXIN IN PLANTS : CONCLUSIONS

Page 13: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

-BARNEYBRITNEY

BA-RNEYBRITNEY

BAR--NEYB-RITNEY

BAR--NEY-BRITNEY

BARNEY AND BRITNEY - A CONNECTION ?

• USEFUL FOR ANNOYING PARENTS

• HARD-WORKING

• SING A LOT

Page 14: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

DYNAMIC PROGRAMMING (aka DYN. PLANNING) 1 1 1 1

1

1

1

1

2 3 4 5

3 6 10 15

4 10 20 35

5 15 35 70

1 1 1 1

1

1

0

0

2 3 4 5

3 6 10 15

3 9 19 34

3 12 31 65

1 1 1 1

1

1

0

0

2 3 1 2

3 6 7 9

3 9 16 25

3 12 28 53

Page 15: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

DYNAMIC P - Needleman and Wunsch ( 1970 )

1 0/1 0/1 0/1 0/1

0/1 0/1 0/1 0/1 0/1

0/1 1/2 0/2 0/2 0/2

0/1 0/2 0/2 0/2 1/3

B R I T N

B

A

R

N

BAR--NEYB-RITNEY

• CAN BE AUTOMATED

• ASKS EVERY AMINO ACID TO BE ALIGNED WITH SOMETHING

• DOES NOT TELL WHETHER SEQUENCES ARE RELATED

• OUTCOME IS DEPENDENT ON HOW MATCHES AND MISMATCHES ARE SCORED

Page 16: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

-BARNEYBRITNEYBA-RNEYBRITNEYBAR--NEYB-RITNEYBAR--NEY-BRITNEY

BARNEY AND BRITNEY - HOW TO QUANTIFY ?

M/R/ID/IDend 1/0/0/0 1/0/-1/0 1/0/-2/-1

3 3 3

4 3 2

5 2 1

4 2 1

Each type of match and each type of mismatch to be scored differently !

Other ideas ?

Page 17: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

SUBSTITUTION MATRICES

• SET OF VALUES IN THE FORM OF 20*20 (AMINO ACIDS)

OR 4*4 (NUCLEOTIDES) MATRIX

• “ SCORE OF CHANGING i TO j ”

• OR, MORE COMMONLY, ONE HALF OF SUCH MATRIX

• “ SCORE OF ALIGNING i TO j ”

ARNDCQEGHILKMFPSTWY

V

4-1 5-2 0 6-2 -2 1 6 0 -3 -3 -3 9-1 1 0 0 -3 5-1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6-2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Page 18: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

HOW TO DERIVE SCORE VALUES

• “ FIRST PRINCIPLES ”

• HOW MANY MUTATIONS ARE REQUIRED TO CHANGE i TO j

• CAN CALCULATE FROM GENETIC CODE, BUT IMPLIES FUNNY

THINGS ABOUT EVOLUTION

• “CHEMICAL ISOFUNCTIONALITY”

STL IV M

DE

AGF YW

TS

Y D E

A G

L I V M F

W

• MUCH, MUCH BETTER - BASED ON THE OBSERVED

FREQUENCIES OF SUBSTITUTIONS

Page 19: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

PAM, BLOSUM, ... - ALL SHOULD BE LOG-ODDS

• SCORE FOR ALIGNING i TO j Sij = log (qij/pipj)

p – background frequencies, q – target frequencies: how to get them?

• PAM (= point accepted mutations) - Dayhoff, 1968

• alignments of 85% identical proteins, 71 families, mostly animal

• used model of evolutionary change with many assumptions

• directly observed data are at short evolutionary distances

• for more distant relationships, multiply matrix by itself - e.g. PAM 120

• BLOSUM (= summary of BLOCKS) – the Henikoffs, 1992

• 500 families, more members, and they are more diverse

• but most importantly – conservation is of a different type!

Page 20: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

BLOCKS : THIS IS THE WAY PROTEINS LIVE

YLHSRG IVHRDLKPENILLDENQLTKRE CIYRDLQVFHILLDLS YCHSQG IMHRDVKPHNVMIDHE YLASKK CIHRDLAARNVLVTED YLEARH FIHRDLAARNCLVSEH YLASKK FVHRDLAARNCMLDEK YLHGWT ILHRDLSSRNILLDHN HLHSIT IVHRDLTSQNILLDEL YLHDEA IIHRDLKSSNILILQK YLHHEC VIHCDVKPENILLDQA

consensus 85 FDYLRRNGLL---------------LSEKEAKKIALQILRG--LE-YLHSRG---IVHRDLKPENILLDEN-------------GTVKIADFG--LARK- 147

query 33 RVMFLRRNDP---------------FPWDLRIKIVICAARGpcVStQLTKRE---CIYRDLQVFHILLDLS--------------------YGavLSRVs 94

1DAW_A 116 KVLYPTLT-------------------DYDIRYYIYELLKA--LD-YCHSQG---IMHRDVKPHNVMIDHEl------------RKLRLIDWG--LAEF- 175

1FGI_A 115 REYLQARRppgleysynpshnpeeqlsSKDLVSCAYQVARG--ME-YLASKK---CIHRDLAARNVLVTED-------------NVMKIADFG--LARD- 192

gi 6226547 393 LEYLRRTDksl--------------lpPIILVQMASQIASG--MS-YLEARH---FIHRDLAARNCLVSEH-------------NIVKIADFG--LARF- 456

gi 125484 1166 RNFIRNEThn---------------ptVKDLIGFGLQVAKG--MK-YLASKK---FVHRDLAARNCMLDEK-------------FTVKVADFG--LARD- 1228

gi 1730077 1375 RQFLTDHFnll-------------eqnPHIRLKLALDIAKG--MN-YLHGWTp-pILHRDLSSRNILLDHNidpknpvvssrqdIKCKISDFG--LSRL- 1454

gi 125874 189 YNILHNPNsstpk----------vkysFPLVLKMATDMALG--LL-HLHSIT---IVHRDLTSQNILLDEL-------------GNIKISDFG--LSAE- 256

gi 462606 88 NRVLSGKRi-----------------pPDILVNWAVQIARG--MN-YLHDEAivpIIHRDLKSSNILILQKveng-----dlsnKILKITDFG--LARE- 159

gi 1346396 615 ANILFSEGgni-------------lldWEGRFNIALGVAKG--LA-YLHHEClewVIHCDVKPENILLDQA-------------FEPKITDFG--LVKL- 682

• BLOCKS ARE REGIONS WITH HIGH STRUCTURAL, FUNCTIONAL, AND EVOLUTIONARY SIGNAL ( or signal-to-noise ratio )

Page 21: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

A R N D C Q E G H I L K M F P S T W Y V

ARNDCQEGHILKMFPSTWYV

4-1 5-2 0 6-2 -2 1 6 0 -3 -3 -3 9-1 1 0 0 -3 5-1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6-2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

LOG-ODDS: RANDOM SCORES ARE NEGATIVE

Page 22: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

RANDOM SEQUENCES ( per S. Altschul ) :

ALIGN TWO SEQUENCES WITH SCORE S (this is called HSP)

- IS THIS SCORE SIGNIFICANTLY DIFFERENT FROM ALIGNING

TO A RANDOM SEQUENCE ? - where RANDOM may be

• COMPUTER – GENERATED ( perhaps with assumptions )

• THEMSELVES BUT SHUFFLED ( Z-scores in many programs )

• REAL BUT UNRELATED SEQUENCES ( e.g. all database )

- and SIGNIFICANT is …..

Page 23: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

BLAST STATISTICS

E = Kmn e-S

E - the expected number of HSPs with score S or higher observed by chance, given the size and complexity of database

m and n – effective lengths of database and query

– parameter from the substitution matrix ( precomputed )

K – parameter from the search space (length+complexity)

Raw Score S : sum of scores in all aligned positions ( matches and mismatches) minus gap penalties

Bit Score S’ : get rid of , reset the log base S’ = log2 K/E + log2 mn

Page 24: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

S’ E

Page 25: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

WHAT IS IN THE BLAST SCORE ?

Bit Score : S’ = log2 K/E + log2 mn

- usually dominated by log2 mn ,

i.e. score distinguishing chance from non-chance is the number of binary choices to map the HSP ( 40 – 45 )

SCORE : DIRECT MEASURE INDEPENDENT OF THE DB ( IF BITS )

STAYS THE SAME WHEN SEQUENCES ARE FLIPPED

E and P VALUES : CALCULATE KNOWING THE SCORE DEPENDENT OF THE DB SIZE NON-SYMMETRICAL EVEN WITH UNGAPPED HSPs

HOMOLOGY STILL HAS TO BE INFERRED

THE OPPOSITE IS NOT TRUE: LOW S ≠ NO HOMOLOGY

Page 26: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

touches ATP-phosphate

touches Mg++

Page 27: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 28: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 29: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

YNIVAQARTGSGKTASFAIPL

Y

Page 30: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

ARNDCQEGHILKMFPSTWYV

0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.3 0 0 0.3 0.1 0 0 0.7 0 0 0 0 0 0 0 0 0 0 0 0

T-G-[GSAT]-G-K-[ST]

Page 31: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

ALL ABOVE ARE PROBABILISTIC MODELS

• IN EACH POSITION, EXPECTED OR OBSERVED FREQUENCY

OF EACH AMINO ACID

• ALIGNMENTS , PROFILES , REGULAR EXPRESSIONS , PSSMs ,

HMMs , etc. ARE ALL INCARNATIONS OF THE SAME IDEA

• ALL OF THE ABOVE CAN BE MATCHED TO EACH OTHER OR

TO A SINGLE SEQUENCE , USING VARIATIONS OF A SCORING

FUNCTION

• AFFORD BETTER SENSITIVITY AND SELECTIVITY THAN ONE

SEQUENCE

Page 32: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 33: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 34: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 35: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 36: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 37: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 38: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 39: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

• HMM IS FOR HIDDEN MARKOV MODEL : WHAT IS HIDDEN ?

• “ OCCASIONALLY DISHONEST CASINO ”

142326546665562262143165

the state of the die is ‘hidden’, but can be revealed

142326546665562262143165

142326546665562262143165

• GIVEN: SEQUENCE; ALIGNMENT; PROBABILITIES OF CHANGES

DETERMINE: IS SEQUENCE PRODUCED BY EVOLUTION OF THE

FAMILY THAT MAKES UP THIS ALIGNMENT ??

Page 40: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 41: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.
Page 42: DATABASE SEARCH & SEQUENCE COMPARISON “ NOTHING IN BIOLOGY MAKES SENSE EXCEPT IN LIGHT OF EVOLUTION ” Theodosius Dobzhansky, 1970 NOTHING IN COMPUTATIONAL.

SOURCES AND ACKNOWLEDGEMENTS

• King Jordan’s class: http://jhunix.hcf.jhu.edu/~kjordan6/

• Sean Eddy’s: http://bio5495.wustl.edu/

• Steve Altschul’s tutorial:

http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu

l-1.html