Top Banner
Entropy, Information contents & Logo plots By Thomas Nordahl Petersen
22

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Entropy, Information contents &Logo plots

By Thomas Nordahl Petersen

Page 2: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA

• Mutiple alignment of acceptor sites from 268 yeast DNA sequences– What is the biological signal around the site ?

– What are the important positions

– How can it be visualized ?

Biological information

Sequence-logo

• Logo plot with Information Content

Exon Intron Exon

Page 3: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Entropy - Definition

• Entropy of random variable is a measure of the uncertainty

• In Thermodynamics G=H-TS– The entropy S of a system is the degree of disorder

Page 4: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Entropy - Definition

• Entropy of a distribution of amino acids– The Shannon entropy:

H(p) = - a pa log2(pa), where p is an amino acid distribution.

H(p) is measured in bits: log2(2) = 1, log2(4)=2

Mutiple alignment of 3 sequencesSeq1: A L P KSeq2: A V P RSeq3: A I K R

High entropy - high disorderLow entropy - low disorder

Page 5: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Entropy - example

H(p) = - a pa log2(pa)

Mutiple alignment of 3 sequencesSeq1: A L RSeq2: A V RSeq3: A I K

Pos1: H(p)= -[1*log2(1)] = 0

Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =

Page 6: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Relative EntropyThe Kullback-Leiber distance D

How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them.

D(p||q) = a pa log2(pa/qa)

Normally a background distribution of the amino acids isobtained as frequencies from a large database like UniProt.

Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71

Page 7: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Information content

D(p||q) = a pa log2(pa/qa) Often the Information content is used as a measure of thedegree of conservation.

I = a pa log2(pa/qa)

A special case is that where all amino acids have the same background distribution: qa = 1/20

Page 8: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Information content

• I = a pa log2(pa/(1/20)) • = a pa [log2pa - log2(1/20)]

• = -H(p) - a palog2(1/20)

• = -H(p) + a palog2(20)

• = -H(p) + log2(20)

• = -H(p) + 4.32

Page 9: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Information content

• I = -H(p) + 4.32 = a palog2pa + 4.32

The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment.

Mutiple alignment of 3 sequences:Seq1: A L RSeq2: A V RSeq3: A I K

Pos1: I = -[1*log2(1)]+ 4.32 = 4.32

Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 =Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32=

Page 10: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA

A 94 88 84 75 78 78 71 69 70 60 68 77 32 49 87 93 93 134 9 266 0 86 66 85 81 89 81 88 82

C 31 45 52 44 56 46 62 54 56 51 46 37 30 42 32 44 30 25 122 1 0 38 65 52 43 62 62 57 43

T 113 110 113 117 104 117 111 120 118 125 136 140 182 155 122 100 124 75 137 0 0 72 85 82 91 83 73 67 96

G 30 25 19 32 30 27 24 25 24 32 18 14 24 22 27 31 21 34 0 1 268 72 52 49 53 34 52 56 47

Count nucleotides at each position:

A 0,35 0,33 0,31 0,28 0,29 0,29 0,26 0,26 0,26 0,22 0,25 0,29 0,12 0,18 0,32 0,35 0,35 0,50 0,03 0,99 0,00 0,32 0,25 0,32 0,30 0,33 0,30 0,33 0,31

C 0,12 0,17 0,19 0,16 0,21 0,17 0,23 0,20 0,21 0,19 0,17 0,14 0,11 0,16 0,12 0,16 0,11 0,09 0,46 0,00 0,00 0,14 0,24 0,19 0,16 0,23 0,23 0,21 0,16

T 0,42 0,41 0,42 0,44 0,39 0,44 0,41 0,45 0,44 0,47 0,51 0,52 0,68 0,58 0,46 0,37 0,46 0,28 0,51 0,00 0,00 0,27 0,32 0,31 0,34 0,31 0,27 0,25 0,36

G 0,11 0,09 0,07 0,12 0,11 0,10 0,09 0,09 0,09 0,12 0,07 0,05 0,09 0,08 0,10 0,12 0,08 0,13 0,00 0,00 1,00 0,27 0,19 0,18 0,20 0,13 0,19 0,21 0,18

Convert to frequencies:

Frequency-logo:

Logo plots - HowTo

Page 11: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Logo plots - Information Content

Sequence-logo

Calculate Information Content

I = apalog2pa + log2(4), Maximal value is 2 bits

• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.

~0.5 each

Completely conserved

Page 12: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Programs to make a Logo plot

• WebLogo• Requires a mutiple alignment as input• Protein or DNA sequences• More output formats

• Blast2Logo• Requires a fasta file as input• Only protein sequences• Runs PSI-blast and makes a table of frequencies• pdf logo plot

Page 13: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

WebLogo - http://weblogo.berkeley.edu/

Page 14: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

WebLogo - http://weblogo.berkeley.edu/

Page 15: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Find important positions>sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesteraseMKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL

What is the next step ?

1 Find homologous sequences - how ?

- Blast or PsiBlast- Download sequences- Make a mutiple alignment- ClustalW or others- or use Blast2Logo program

Page 16: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Mutiple alignment programs

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

Page 17: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Blast2logo - http://www.cbs.dtu

.dk/biotools/Blast2logo-1.0/

Page 18: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Important positions

Important positions in proteins are conservedpositions => high Information Content.

Conserved for a reason:• Functionally important positions

• Catalytic residues

• Structurally important positions• Manitain the correct fold of the protein

Page 19: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Blast2logo

Runs iterative blast i.e. Psi-Blast

Searching for homologues sequences by useof Position Specific Scoring Matrices (PSSM).

1. Iteration - use Blosum62 scoring matrix2. Iteration - make profile of seq found in iteration 13. Iteration - make profile of seq found in iteration 24. Iteration - Calculate aa freq at each position inquery sequence. Correct for low counts and weightseq such that very similar seq are down weighted

Page 20: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Important positions - counting

Page 21: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Example. Where is the active site?• Sequence profiles might show you where to look!• The active site could be around

• S9, G42, N74, and H195

Page 22: Entropy, Information contents & Logo plots By Thomas Nordahl Petersen.

Exercise

1. Calculate nucleotide frequencies from a mutiple alignment of human donor sites

2. Calculate Entropy and Information content

3. Draw (by hand) a Logo plot

4. Use 2 Logo plot programs

5. Learn to interpret Logo & frequency plots

6. Active site residues & structural residues