Top Banner
Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families of proteins. Functional prediction based on sequence. Gabriel Pons, Departament de Ciències Fisiològiques II, Campus de Ciències de la salut. Bellvitge. Universitat de Barcelona
41

Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison

strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families of proteins.

Functional prediction based on sequence.

Gabriel Pons, Departament de Ciències Fisiològiques II, Campus deCiències de la salut. Bellvitge. Universitat de Barcelona

Page 2: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Sequence comparison

Page 3: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Goals

• To take advantage from functional or structural information identifiyng homologies between sequences

• Differences between Homology and identity

• Two sequences are homologous when:– They have the same evolutive origin– They have similar function and structure

Page 4: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

• Homologous sequences - sequences that share a commonevolutionary ancestry• Similar sequences - sequences that have a high percentage ofaligned residues with similar physicochemical properties(e.g., size, hydrophobicity, charge)

IMPORTANT:• Sequence homology:• An inference about a common ancestral relationship, drawn whentwo sequences share a high enough degree of sequence similarity• Homology is qualitative• Sequence similarity:• The direct result of observation from a sequence alignment• Similarity is quantitative; can be described using percentages

Page 5: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

More definitions

• Orthologs: sequences which exactely correspond to the same function/structure in different species

• Paralogs: sequences produced by gene duplications in the same organism. Usually, it involves change in function, but keeping functional relationship many times.

Page 6: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Homology

Page 7: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Homology and prediction

• Very divergent protein sequences may suport similar structures

• Similar protein structures will probably have related or similar functions

Page 8: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

3D STRUCTURE VERSUS SEQUENCESequence alignment between human myoglobin, and globins from hemoglobin

Page 9: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

myoglobin -globin -globin

Comparison of 3D structures of human myoglobin, and globins from hemoglobin

Page 10: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Superposition of 3D structures of human myoglobin and globin from hemoglobin

Page 11: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Homology and prediction

• Sequence comparison is the simplest method in order to identify the presence of homology between sequences.

• Identity > 30% in proteins involves homology (>65% nucleic)

• Identity > 80-90% usual in orthologs from close species

• Identity 10-30%. If there is homology may be not detectable (“twilight zone”)

Page 12: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

No me gusta la bioinformaticaTeme usted la ionosfera optica

Nomegusta-labioin-forma--ticaTeme-ustedla-ionosfer-aoptica

64% identity? But…

I don´t like bioinformaticsDo you fear optical ionospher?

Page 13: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

¿DNA or protein?

• Both give information about homología

• Protein: Exists functional equivalence between aminoacids

Page 14: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

DNA: only identity is relevant

Mismatches do not have variable cost. No substitution is better than other usually

Canonical base pairing (Watson-Crick)

Page 15: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

• genetic code

Pos 1 Posición 2 Pos 3

U C A G

U Phe

Phe

Leu

Leu

Ser

Ser

Ser

Ser

Tyr

Tyr

Stop

Stop

Cys

Cys

Stop

Trp

U

C

A

G

C Leu

Leu

Leu

Leu

Pro

Pro

Pro

Pro

His

His

Gln

Gln

Arg

Arg

Arg

Arg

U

C

A

G

A Ile

Ile

Ile

Met

Thr

Thr

Thr

Thr

Asn

Asn

Lys

Lys

Ser

Ser

Arg

Arg

U

C

A

G

G Val

Val

Val

Val

Ala

Ala

Ala

Ala

Asp

Asp

Glu

Glu

Gly

Gly

Gly

Gly

U

C

A

G

• Trp, Met (1)• Leu, Ser, Arg (6)• others (2)• Initiation AUG• Stop (3)

Third base pare degeneration

XYC = XYUXYA ~ XYG

Page 16: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

“Equivalent aminoacids”

• Hydrophobics– Ala (A), Val (V), Met (M), Leu (L), Ile (I), Phe (F), Trp (W), Tyr (Y)

• Small– Gly (G), Ala (A), Ser (S)

• Polar– Ser (S), Thr (T), Asn (N), Gln (Q), Tyr (Y) – En la superficie de la proteína polares y cargados son equivalentes

• With charge– Asp (D), Glu (E) / Lys (K), Arg (R)

• Difficult to be substituted– Gly (G), Pro (P), Cys (C), His (H)

• BE CAREFULL: aminoacids do not always perform the same function in proteins

Page 17: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

HistidinFor the hemo coordination bonds

Prolin in a turn

2 conserved glycines in 2 separate helix crossing each other

3D visualization of some conserved residues in globin family (Myoglobin structure)

Page 18: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

• DNA sequence diverges quicker than protein– Mutation or recombination may alter DNA but must

mantain function/structure

• Protein sequence comparison permits finding and localize very distant homologous proteins

Page 19: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Sequence alignment

• Measure the degree of similarity/identity and thus the existence of homology requires un “alignment”

Strong identity/similarity:

AWTRRATVHDGLMEDEFAAAWTRRATVHDGLCEDEFAA

Weak identity/similarity:

AWTKLATAVVVFEGLCEDEWGGAWTRRAT---VHDGLMEDEFAA

Page 20: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Alignments

• “pairwise”– 2 sequences

• Multiple– More than 2 sequences

• Global– Whole sequence is considered

• Local– Only similar regions are aligned

Page 21: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

StrategiesDepends of the goal

• Sequence comparison– Goal: establish homology, identify equivalent

aminoacuds • global, ”pairwise”/multiple

• Search in data bases– Goal: Identify homologous proteins in a big

group of sequences• Local, “pairwise”

Page 22: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Automatic Alignment

• Requires – Objective method to compare aminoacids or bases in

order to “score” the alignment (comparison matrix)– Algoritm to find the best alignment with the maximal

score

• Quick and easy to reproduce

• Do not permit, in general, introduce additional information

Page 23: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Matrix types

• Identity

• Physico-chemical properties

• Genetics (codon substitution)

• Evolution

Page 24: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.
Page 25: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Blosum 62 Small positive score for changes in similaraminoacids

Small positive score for commonaminoacids Infrequente aminoacids

have high scoreHigh Penalty for very different aminoacids

Same score independent of position !!

Page 26: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Rat versus mouse protein

Rat versus bacterialprotein

BLOSUM90PAM30

BLOSUM45PAM240

BLOSUM80PAM120

BLOSUM62PAM180

Choice of a Matrix!

Page 27: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Query Length Substitution Matrix Gap Costs

<35 PAM-30 (9,1)

35-50 PAM-70 (10,1)

50-85 BLOSUM-80 (10,1)

85 BLOSUM-62(10,1)

PAM Point Accepted Mutatiton

Page 28: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Gaps (inserciones/delecciones)

• Normalmente localizados en loops

AWTKLATAVVVFEGLCEDEWGAWTKLATAVVVFEGLCEDEWGGGAWTRRAT---AWTRRAT---VHDGLMEDEFAAVHDGLMEDEFAA

Page 29: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Global versus local alignment

• Global alignment– Finds best possible alignment across entire length of 2

sequences– Aligned sequences assumed to be generally similar over entire

length• Local alignment

– Finds local regions with highest similarity between 2 sequences– Aligns these without regard for rest of sequence– Sequences are not assumed to be similar over entire length

Page 30: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Comparación de secuencias contra bases de datos

Secuencia incógnitaATTVG...LMN

Base de datos De secuencias

AGLM...WTKRTCGGLMN..HICGWRKCPGL...

Requiere algoritmos de comparación muy rápidos

Page 31: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Alignments

• “pairwise”– 2 sequences

• Multiple– More than 2 sequences

• Global– Whole sequence is considered

• Local– Only similar regions are aligned

Page 32: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Diasdvantages from global alignment

• Slow

• Scores whole sequence– Do not recognize multidomain proteins

A B C

A C’

B D

Global alignment server

Page 33: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

alfa-globinMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Beta-globinMVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Page 34: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Alfa-actininMNQIEPGVQYNYVYDEDEYMIQEEEWDRDLLLDPAWEKQQRKTFTAWCNSHLRKAGTQIENIEEDFRNGLKLMLLLEVISGERLPKPDRGKMRFHKIANVNKALDYIASKGVKLVSIGAEEIVDGNVKMTLGMIWTIILRFAIQDISVEETSAKEGLLLWCQRKTAPYRNVNIQNFHTSWKDGLGLCALIHRHRPDLIDYSKLNKDDPIGNINLAMEIAEKHLDIPKMLDAEDIVNTPKPDERAIMTYVSCFYHAFAGAEQAETAANRICKVLAVNQENERLMEEYERLASELLEWIRRTIPWLENRTPEKTMQAMQKKLEDFRDYRRKHKPPKVQEKCQLEINFNTLQTKLRISNRPAFMPSEGKMVSDIAGAWQRLEQAEKGYEEWLLNEIRRLERLEHLAEKFRQKASTHETWAYGKEQILLQKDYESASLTEVRALLRKHEAFESDLAAHQDRVEQIAAIAQELNELDYHDAVNVNDRCQKICDQWDRLGTLTQKRREALERMEKLLETIDQLHLEFAKRAAPFNNWMEGAMEDLQDMFIVHSIEEIQSLITAHEQFKATLPEADGERQSIMAIQNEVEKVIQSYNIRISSSNPYSTVTMDELRTKWDKVKQLVPIRDQSLQEELARQHANERLRRQFAAQANAIGPWIQNKMEEIARSSIQITGALEDQMNQLKQYEHNIINYKNNIDKLEGDHQLIQEALVFDNKHTNYTMEHIRVGWELLLTTIARTINEVETQILTRDAKGITQEQMNEFRASFNHFDRRKNGLMDHEDFRACLISMGYDLGEAEFARIMTLVDPNGQGTVTFQSFIDFMTRETADTDTAEQVIASFRILASDKPYILAEELRRELPPDQAQYCIKRMPAYSGPGSVPGALDYAAFSSALYGESDL

CalmodulinMADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK

Page 35: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Alineamiento local

• 10 – 100x más rápidos

• Reconocen dominios individuales

• No proporcionan necesariamente el mejor alineamiento!

• BLAST, FASTA

Page 36: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Basic Local Alignment Search ToolBlast NCBI

Page 37: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.
Page 38: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.
Page 39: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

E value (Expect)

• E value:• Expect: This setting specifies the statistical significance threshold for reporting

matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.

E = K.m.n.e-.S

• Warning:

• E → Falsos negativos

Score

Normalization factors

Number of letters in query

Number of letters in data baseScore

Page 40: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

E parameter (More)• Expect

For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. However, keep in mind that searches with short sequences, can be virtually indentical and have relatively high EValue. This is because the calculation of the E-value also takes into account the length of the Query sequence. This is because shorter sequences have a high probability of occuring in the database purely by chance.

Page 41: Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

Exercice

• Find mouse orthologous. Data

• Find closest human paralogous

• Find highest significant homolog in drosophila