207945773 Molecular Evolution

7/24/2019 207945773 Molecular Evolution

1/66

Molecular evolution

Dr. Yougesh [email protected]


2/66

The increasing available completely sequenced

organisms and the importance of evolutionary processes

that affect the species history, have stressed the interest in

studying the molecular evolution events at the sequence

level.

Molecular evolution


3/66

Plan

Context

selection pressure (definitions)

Genetic code and inherent properties of codons and

amino-acids

Estimations of synonymous and nonsynomynous

substitutions

Codons volatility

Applications


4/66

Ancestor

species genome

Evolutionary processes include:

Phylogeny*

duplication genesis

Expansion*

HGT HGT

Exchange* loss Deletion*

and selection


5/66

Time Duplication

Duplication

Speciation

Speciation

AB C

A B C

Species tree

A B C

Gene tree

Gene tree - Species tree

Genomes 2 edition 2002.. T.A. Brown


6/66

Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS iol 2(!): e20".

Original version

Actual version


7/66

Homolog - Paralog - Ortholog

A1A2

B1

B2

Homologs: A1, B1, A2, B2

Paralogs: A1vs B1and A2vs B2

Orthologs: A1vs A2and B1vs B2

S1 S2a b

A

O

B

Species-1Species-2

A1A2

B1

B2

Sequence analysis


8/66

Molecular evolutionary analysis

Aim at understanding and modeling evolutionary

events;

Evolutionary modeling extrapolates from the divergence

between sequences that are assumed homologous, thenumber of events which have occurred since the genes

diverged;

If rate of evolution is known, then a time sincedivergence can be estimated.


9/66

Applications:

Molecular evolution analysis has clarified:

the evolutionary relationships between humans and

other primates;

the origins of AIDS;

the origin of modern humans and population migration;

speciation events;

genetic material exchange between species.

origin of some deseases (cancer, etc...)

.....

Molecular evolution


10/66

GACGACCATAGACCAGCATAG

GACTACCATAGA-CTGCAAAG

*** ******** * *** **


GACTACCATAGACT-GCAAAG

*** ********* *** **

Two possible

positions for theindel

Molecular evolution


11/66

Molecular evolution

Mutations arise due to inheritable changes in

genomic DNA sequence;

Mechanisms which govern changes at the

protein level are most likely due to nucleotide

substitution or insertions/deletions;

Changes may give rise to new genes which

become fixed if they give the organism an

advantage in selection;


GACTACCATAGA-CTGCAAAG


12/66

Molecular evolution: Definitions

Purifying (negative) selection

A consequence of gene drift through random

mutations, is that many mutations will have deleterious

effects on fitness.

Purifying selective force prevents accumulation ofmutation at important functional sites, resulting in

sequence conservation.

-> Purifying selection is a natural selection againstdeleterious mutations.

-> The term is used interchangeably with negative

selectionor selection constraints.


13/66

Neutral theory

Majority of evolution at the molecular level is caused byrandom genetic drift through mutations that are

selectively neutral or nearly neutral.

Describes cases in which selection (purifyingor positive)

is not strong enough to outweigh random events.

Neutral mutation is an ongoing process which gives rise

to genetic polymorphisms; changes in environment can

select for certain of these alleles.


14/66

Positive selection

Positive selection is a darwinian selection fixing

advantageous mutations.

The term is used interchangeably with molecular

adaptation and adaptive molecular evolution.

Positive selection can be shown to play a role in someevolutionary events

This is demonstrated at the molecular level if the rate of

nonsynonymous mutation at a site is greater than the rateof synonymous mutation

Most substitution rates are determined by either neutral

evolution of purifying selection against deleteriousmutations


15/66

Molecular evolution

We observe and try to decode the process of

molecular evolution from the perspective of

accumulated differences among related genes

from one or diverse organisms.

The number of mutations that have occurred

can only be estimated.

Real individual events are blurred by a long

history of changes.


16/66

-GGAGCCATATTAGATAGA-

-GGAGCAATTTTTGATAGA-

Gly Ala Ile Leu asp Arg

Gly Ala Ile Pheasp Arg

DNA yields more phylogenetic information than proteins. The

nucleotide sequences of a pair of homologous genes have a higherinformation content than the amino acid sequences of the

corresponding proteins, because mutations that result in synonymous

changes alter the DNA sequence but do not affect the amino acid

sequence.

3 different DNA positions but

only one different amino acid

position:

2 of the nucleotide substitutions

are therefore synonymousandone is non-synonymous.

Nucleotide, amino-acid sequences

-> gene

-> protein


17/66

Kinds of nucleotide substitutionsGiven 2 nucleotide sequences, we can ask how their similarities and

differences arose from a common ancestor?

A

A

C

Single substitution

1 change, 1 difference

T

A

A

C

Multiple substitution

2 changes, 1 difference

A

C

G

Coincidental substitution

2 change, 1 difference

A

C

C

Parallel substitution

2 changes, no difference

A

T

T

C

Convergent substitution


A

A

AC

Back substitution



18/66

Substitution: Transition - transversion

transitionchanges one

purinefor another or onepyrimidinefor another.

transversionchanges apurinefor a pyrimidineor

vice versa.

Nucleotides are either purineor pyrimidines:

G(Guanine) and A(Adenine) are called purine;

C(Cytosine) and T(Thymine) are called pyrimidines.

transitionsoccur at least 2 times as frequently as transversions

A G

C T


19/66

Standard genetic code

The genetic code specifies how a combination of any ofthe four bases (A,G,C,T) produces each of the 20 amino

acids.

The triplets of bases are called codonsand with fourbases, there are 64 possible codons:

(43

) possible codons that code for 20 amino acids (and stopsignals).


20/66

Second position

| T | C | A | G |

----+--------------+--------------+--------------+--------------+----

| TTT Phe (F) | TCT Ser (S)| TAT Tyr (Y) | TGT Cys (C)| T

T| TTC " | TCC " | TAC | TGC | C F | TTA Leu (L)| TCA " | TAA Ter | TGA Ter | A T

i | TTG " | TCG " | TAG Ter | TGG Trp (W)| G h

r --+--------------+--------------+--------------+--------------+-- i

s | CTT Leu (L)| CCT Pro (P) | CAT His (H) | CGT Arg (R) | T r

t C| CTC " | CCC " | CAC " | CGC " | C d

| CTA " | CCA " | CAA Gln (Q) | CGA " | A

P | CTG " | CCG " | CAG " | CGG " | G P

o --+--------------+--------------+--------------+--------------+-- o

s | ATT Ile (I)| ACT Thr (T) | AAT Asn (N) | AGT Ser (S)| T s

i A| ATC " | ACC " | AAC " | AGC " | C i

t | ATA " | ACA " | AAA Lys (K) | AGA Arg (R) | A t

i | ATG Met (M) | ACG " | AAG " | AGG " | G i

o --+--------------+--------------+--------------+--------------+-- o

n | GTT Val (V) | GCT Ala (A) | GAT Asp (D) | GGT Gly (G) | T n

G| GTC " | GCC " | GAC " | GGC " | C

| GTA " | GCA " | GAA Glu (E) | GGA " | A

| GTG " | GCG " | GAG " | GGG " | G

----+--------------+--------------+--------------+--------------+----


charg(basique),charg(acidique),

hydrophile,hydrophobe

A Ala Alanine GCT GCC GCA GCG

R Arg Arginine CGT CGC CGA CGG AGA AGG

N Asn Asparagine AAT AACD Asp Aspartic acid GAT GAC

C Cys Cysteine TGT TGC

Q Gln Glutamine CAA CAG

E Glu Glutamic acid GAG GAA

G Gly Glycine GGG GGA GGT GGC

H His Histidine CAT CAC

I Ile Isoleucine ATT ATC ATA

L Leu Leucine TTA TTG CTT CTC CTA CTG

K Lys Lysine AAA AAG

M Met Methionine ATG

F Phe Phenylalanine TTT TTC

P Pro Proline CCT CCC CCA CCG

S Ser Serine TCT TCC TCA TCG AGT AGC

T Thr Threonine ACT ACC ACA ACG

W Trp Tryptophan TGG

Y Tyr Tyrosine TAT TAC

V Val Valine GTT GTC GTA GTG

Because there are only 20 amino acids, but 64 possible codons, the same aminoacid is often encoded by a number of different codons, which usually differ in the

third base of the triplet.

Because of this repetition the genetic code is said to be degenerateand codons

which produce the same amino acid are called synonymous codons.


21/66

Important properties inherent to

the standard genetic code


22/66

Synonymous vs nonsynonymous substitutions

Nondegenerate sites: are codon position where mutations always

result in amino acid substitutions.

(exp. TTT(Phenylalanyne, CTT(leucine), ATT(Isoleucine), and

GTT(Valine)).

Twofold degenerate sites: are codon positions where 2 different

nucleotides result in the translation of the same aa, but the 2 otherscode for a different aa.

(exp. GATand GACcode for Aspartic acid (asp, D),

whereas GAAand GAGboth code for Glutamic acid (glu, E)).

Threefold degenerate site: are codon positions where changing 3

of the 4 nucleotides has no effect on the aa, while changing the

fourth possible nucleotide results in a different aa.

There is only 1 threefold degenerate site: the 3rdposition of an isoleucine codon.ATT ATC or ATAall encode isoleucine but ATGencodes methionine.


23/66


Three amino acids: Arginine, Leucine and Serine are encoded by 6 different

codons:

R Arg Arginine CGT CGC CGA CGG AGA AGG

L Leu Leucine TTA TTG CTT CTC CTA CTG

S Ser Serine TCT TCC TCA TCG AGT AGC

Five amino-acids are encoded by 4 codons which differ only in the third position.

These sites are called fourfold degenerate sites

A Ala Alanine GCT GCC GCA GCG


P Pro Proline CCT CCC CCA CCG

T Thr Threonine ACT ACC ACA ACG

V Val Valine GTT GTC GTA GTG

Fourfold degenerate sites: are codon positions where changing a

nucleotide in any of the 3 alternatives has no effect on the aa.

exp. GGT, GGC,GGA, GGG(Glycine);

CCT,CCC,CCA,CCG(Proline)


24/66

Standard genetic codeNine amino acids are encoded by a pair of codons which differ by a transition

substitution at the third position. These sites are called twofold degenerate sites.

Isoleucine is encoded by three different codons

Methionine and Triptophan are encoded by single codon

Three stop codons: TAA, TAG and TGA

N Asn Asparagine AAT AAC

D Asp Aspartic acid GAT GAC

C Cys Cysteine TGT TGC

Q Gln Glutamine CAA CAG

E Glu Glutamic acid GAG GAA

H His Histidine CAT CAC

K Lys Lysine AAA AAGF Phe Phenylalanine TTT TTC

Y Tyr Tyrosine TAT TAC

I Ile Isoleucine ATT ATC ATA

M Met Methionine ATG

W Trp Tryptophan TGG

Transition:

A/G; C/T


25/66

Nucleotide substitutions in protein coding genes can be divided into : synonymous (or silent) substitutions i.e. nucleotide substitutions

that do not result in amino acid changes.

non synonymous substitutions i.e. nucleotide substitutions that

change amino acids.

nonsense mutations, mutations that result in stop codons.

exp: Gly: any changes in 3rd position of codon results in Gly; any

changes in second position results in amino acid changes; and so isthe first position.

Standard Genetic Code

GAG


Glu AGC Serexp:


26/66

Estimation of synonymous and nonsynonymous substitution rates

is important in understanding the dynamics of molecular sequence

evolution.

As synonymous(silent) mutations are largely invisible to natural

selection, while nonsynonymous(amino-acid replacing) mutations

may be under strong selective pressure, comparison of the rates of

fixation of those two types of mutations provides a powerful tool for

understanding the mechanisms of DNA sequence evolution.

For example, variable nonsynonymous/synonymous rate ratios

among lineages may indicate adaptative evolutionor relaxed

selective constraintsalong certain lineages.

Likewise, models of variable nonsynonymous/synonymous rate

ratios among sites may provide important insights into functional

constraints at different amino acid sites and may be used to detectsites under positive selection.

Nonsynonymous/synonymous substitutions


27/66

Codon usage

If nucleotide substitution occurs at random at each nucleotide site,

every nucleotide site is expected to have one of the 4 nucleotides, A,

T, C and G, with equal probability.Therefore, if there is no selection and no mutation bias, one would

expect that the codons encoding the same amino acid are on average

in equal frequencies in protein coding regions of DNA.

In practice, the frequencies of different codons for the same aminoacid are usually different, and some codons are used more often than

others. This codon usage bias is often observed.

Codon usage bias is controlled by both mutation pressureand

purifying selection.

There are 64 (43) possible codons that code for 20 amino acids

(and stop signals).


28/66

For a pair of homologous codons presenting only one nucleotide

difference, the number of synonymous and nonsynonymoussubstitutions may be obtained by simple counting of silent versus

non silent amino acid changes;

For a pair of codons presenting more than one nucleotide

difference, distinction between synonymous and nonsynonymoussubstitutions is not easy to calculate and statistical estimation

methods are needed;

For example, when there are 3 nucleotide differences between

codons, there are 6 different possible pathways between thesecodons. In each path there are 3 mutational steps.

More generally there can be many possible pathways between

codons that differ at all three positions sites; each pathway has its

own probability.

Estimating synonymous and nonsynonymous differences


29/66

Observed nucleotide differences between 2 homologous sequences

are classified into 4 categories: synonymous transitions, synonymoustransversions, nonsynonymous transitionsand nonsynonymous

transversions.

When the 2 compared codons differ at one position, the

classification is obvious.

When they differ at 2 or 3 positions, there will be 2 of 6

parsimonious pathways along which one codon could change into the

other, and all of them should be considered.

Estimating synonymous and nonsynonymous differences

Since different pathways may involve different numbers of

synonymous and nonsynonymous changes, they should be weighted

differently.


30/66

SEQ.1 GAAGTTTTT

SEQ.2 GACGTCGTA

Glu Val Phe

Asp Val Val

Codon 1: GAA--> GAC;1 nuc. diff., 1 nonsynonymous difference;

Codon 2: GTT--> GTC;1 nuc. diff., 1 synonymous difference;

Codon 3: counting is less straightforward:

TTT(F:Phe)

GTT(V:Val)TTA(L:Leu)

GTA(V:Val)

1

2

Path 1: implies 1

non-synonymous

and 1 synonymoussubstitutions;

Path 2: implies 2

non synonymous

substitutions;

Example: 2 homologous sequences


31/66

Evolutionary Distance estimationbetween 2 sequences

The simplest problem is the estimation of the number of

synonymous (dS) and nonsynonymous (dN) substitutions per sitebetween 2 sequences:

the number of synonymous (S) and nonsynonymous (N) sites in the

sequences are counted;

the number of synonymous and nonsynonymous differences

between the 2 sequences are counted;

a correction for multiple substitutions at the same site is applied to

calculate the numbers of synonymous (dS) and nonsynonymous(dN) substitutions per site between the 2 sequences.

==> many estimation Methods


32/66

Evolutionary Distance estimation

In general the genetic code affords fewer opportunities for

nonsynonymous changes than for synonymous changes.

rate of synonymous >>rate of nonsynonymous substitutions.

Furthermore, the likelihood of either type of mutation is highly dependent on

amino acid composition.

For example: a protein containing a large number ofleucineswill contain manymore opportunities for synonymous change than will a protein with a high

number of lysines.

L Leu Leucine TTA TTG CTT CTC CTA CTG4forld degeneratesite

2fold degenerate siteSeveral possible substitutions that will not change the aaLeucine

K Lys Lysine AAA AAG

Only one possible mutation at 3rd position that will not changeLysine


33/66

Evolutionary Distance estimation

Fundamental for the study of protein evolution and useful for

constructing phylogenetic trees and estimation of divergence time.


34/66

QuickTime et un dcompresseur TIFF (non compress) sont requis pour visionner cette image.

Ziheng Yang & Rasmus Nielsen (2000)

Estimating synonymous and nonsynonymous substitution rates under

realistic evolutionary models.Mol Biol Evol.17:32-43.

Estimating synonymous and nonsynonymous substitution rates

P if i l ti


35/66

Purifying selection:

Most of the time selection eliminates deleterious mutations, keeping

the protein as it is.

Positive selection:

In few instances we find that dN(also denoted Ka) is much greater

than dS(also denoted Ks) (i.e. dN/dS>> 1 (Ka/Ks >>1 )). This is strong

evidence that selection has acted to change the protein.Positive selection was tested for by comparing the number of nonsynonymous substitutions pernonsynonymous site (d

N) to the number of synonymous substitutions per synonymous site (d

S). Because

these numbers are normalized to the number of sites, if selection were neutral (i.e., as for a

pseudogene) the dN/d

Sratio would be equal to . !n unequivocal sign of positive selection is a d

N/d

S

ratio significantly e"ceeding , indicating a functional benefit to diversify the amino acid sequence.

dN/dS< 0.25indicates purifying selection;

dN/dS= 1suggests neutral evolution;

dN/dS>> 1indicates positive selection.


36/66

Negative (purifying) selection eliminates disadvantageous

mutations i.e. inhibits protein evolution.

(explains why dN< dSin most protein coding regions)

Positive selectionis very important for evolution of new functions

especially for duplicated genes.

(must occur early after duplication otherwise null mutations and

will be fixed producing pseudogenes).

dN/dS(or Ka/Ks) measures selection pressure


37/66

Mutational saturation

Mutational saturation in DNA and protein sequencesoccurs when sites have undergone multiple mutations

causing sequence dissimilarity (the observed differences)

to no longer accurately reflect the true evolutionary

distance i.e. the number of substitutions that haveactually occurred since the divergence of two sequences.

Correct estimation of the evolutionary distance is crucial.

Generally: sequences where dS > 2 are excluded to avoid

the saturation effect of nucleotide substitution.


38/66

PAML: Phylogenetic Analysis by Maximum Likelihood (PAML)

http://abacus.gene.ucl.ac.uk/software/paml.html

YN00 - P13.4.C13.18.fa.paml

ns = 13 ls = 29

Estimation by the method of:

Yang & Nielsen (2000):

seq. seq. S N t kappa omega dN +- SE dS +- SE

YALI0A08195g YALI0A17963g 15.1 71.9 0.37 1.31 0.20 0.07 +- 0.03 0.36 +- 0.22

YALI0E25443g YALI0A17963g 17.3 69.7 1.8 1.31 0.05 0.13 +- 0.05 2.55 +- 13.95YALI0E25443g YALI0A08195g 17.6 69.4 1.00 1.31 0.06 0.08 +- 0.03 1.35 +- 0.70

YALI0C21230g YALI0A17963g 24.1 62.9 5.35 1.31 0.75 1.63 +- 1.06 2.19 +- 1.70

YALI0C21230g YALI0A08195g 24.5 62.5 6.58 1.31 0.57 1.81 +- 1.43 3.19 +- 6.21

YALI0C21230g YALI0E25443g 24.9 62.1 4.76 1.31 1.27 1.69 +- 0.57 1.33 +- 0.59

YALI0C21230g YALI0A02783g 24.6 62.4 4.71 1.31 3.58 1.97 +- 0.81 0.55 +- 0.21

YALI0C21230g YALI0C21252g 25.4 61.6 6.64 1.31 3.22 2.77 +- 2.27 0.86 +- 0.32

YALI0C21230g YALI0C21274g 25.3 61.7 6.54 1.31 3.46 2.75 +- 2.21 0.79 +- 0.34

YALI0C21230g YALI0F09944g 24.3 62.7 7.51 1.31 2.31 2.97 +- 2.93 1.29 +- 1.09

YALI0C21230g YALI0A13497g 28.2 58.8 7.13 1.31 3.20 3.06 +- 3.38 0.95 +- 0.34. ..

YALI0C21230g YALI0B06160g 27.1 59.9 7.34 1.31 1.66 2.79 +- 2.37 1.68 +- 0.86

YALI0D11638g YALI0C21230g 27.3 59.7 8.04 1.31 1.68 3.07 +- 3.40 1.83 +- 1.39

YALI0E19140g YALI0C21230g 25.2 61.8 7.67 1.31 2.48 3.09 +- 3.46 1.25 +- 0.54

YALI0E19140g YALI0D11638g 22.4 64.6 4.12 1.31 0.45 1.04 +- 0.29 2.33 +- 2.13

-> yn00 similar results than ML (Yang & Nielsen (2000))

-> advantage : easy automation for large scale comparisons;


39/66

Relative Rate Test

1 2 3

A

For determining the relative rate of

substitution in species 1 and 2, we need and

outgroup (species 3).

The point in time when 1 and 2 diverged is

marked A (common ancestor of 1 and 2).The number of substitutions between any two species is assumed to

be the sum of the number of substitutions along the branches of the

tree connecting them:

d13=dA1+dA3

d23=dA2+dA3

d12=dA1+dA2

d13, d23and d12are measures of the differences

between 1 and 3, 2 and 3 and 1 and 2 respectively.

dA1=(d12+d13-d23)/2

dA2=(d12+d23-d13)/2

dA1and dA2should be the

same (A common ancestor

of 1 and 2).


40/66

Evolution of functionally important regions over time. Immediately after a speciation event, the two copies of the

genomic region are 100% identical (see graph on left). Over time, regions under little or no selective pressure,

such as introns, are saturated with mutations, whereas regions under negative selection, such as most eons,

retain a higher percent identity (see graph on right). !any se"uences involved in regulating gene epression

also maintain a higher percent identity than do se"uences with no function.

COMPARATIVE GENOMICS

Webb Miller, Kateryna D. Makova, Anton Nekrutenko, and Ross C. Hardison

Annual Review of Genomics and Human Genetics

#ol. $ 1$&$' (00)


41/66

Yang & Nielsen,

Esimating Synonymous and Nonsynonymous Substitution Rates Under

Realistic Evolutionary Models

Mol. Biol. Evol. 2000, 17:32-43

=>Other estimation Models

Reference

E l ti Di t ti ti b t 2


42/66

Evolutionary Distance estimationbetween 2 sequences

Under certain conditions, however, nonsynonymous substitution may be

accelerated by positive Darwinian selection. It is therefore interesting to examine

the number of synonymous differences per synonymous site and the number ofnonsynonymous differences per nonsynonymous site.

p-distance:

ps= Sd/S proportion of synonymous differences ;

var(ps) = p

s(1-p

s)/S.

pn= Nd/N proportion of non synonymous differences;

var(pn) = pn(1-pn)/S.

Sdand Ndare respectively the total number of synonymous and non

synonymous differences calculated over all codons. S and N are the

numbers of synonymous and nonsynonymous substitutions.

S+N=n total number of nucleotides and N >> S.
http://abacus.gene.ucl.ac.uk/software/paml.htmlhttp://abacus.gene.ucl.ac.uk/software/paml.html


43/66

Substitutions between protein sequences

p = nd/n

V(p)=p(1-p)/n

ndand n are the number of amino acid differences and the total number of

amino acids compared.

However, refining estimates of the number of substitutions that have occurred

between the amino acid sequences of 2 or more proteins is generally more

difficult than the equivalent task for coding sequences (see paths above).

One solution is to weight each amino acid substitution differently by using

empirical data from a variety of different protein comparisons to generate amatrix as the PAM matrix for example.


44/66

Otherdistancemodels


45/66

Jukes-Cantor model:A T C G

A - l l lT l - l lC l l - lG l l l - l is the rate of substitution.

Tajima-Nei model:A T C G

A g dT g d

C - dG g - ,, g and d are the rates of substitution.

Kimura 2-parameters model:A T C G

A

T

C andare the rates of transitional

G and transvertional substitutions

Tamura model:A T C G

A - (1-q) q q

T (1-q) - q q andare the rates of transitional

C (1-q) (1-q) - q and transvertional substitutions

G (1-q) (1-q) q - and q is the G+C content.

Hasegawa et al. model:A T C G

A - gT gC gG

T gA - gC gG andare the rates of transitional

C gA gT - gG and transvertional substitutions

G gA gT gC - and gi the nucleotide frequencies

(i=A,T,C,G).

Tamura-Nei model:A T C G

A - gT gC gG1 1and2are the rates of transitional substitutions

T gA - gC2 gG between purines and between pyrimidines;

C gA gT2 - gG is the rate of transvertional substitutions;

G gA1 gT gC - and githe nucleotide frequencies (i=A,T,C,G).

Other distance models


46/66

Example: yn00 in PAML.

Protein sequences in a family

and corresponding DNA sequences

Procedure


47/66

1.Alignment of a family protein sequences usingclustalW

2.Alignment of corresponding DNA sequences using as template their

corresponding amino acid alignment obtained in step 1

3.Format the DNA alignment in yn00 format

4.Perform yn00 program (PAML package) on the obtained DNA alignment

5.Clean the yn00 output to get YN (Yang & Nielsen) estimates in a file.

Estimations with large standard errors were eliminated

6.From YN estimates extract gene pairs with w = dN/d

S>= 3 and gene pairs with

w=3 are considered as candidate genes on which positive

selection may operate. Whereas genes with w


48/66

S N

0.0 0.$ 1.0 1.$ .0 .$ *.0 *.$ .0 .$ $.0 $.$ '.00.00.$1.0

1.$.0.$*.0*.$.0.$$.0$.$'.0'.$+.0+.$

dN

m std n min MaxdN 0.90 0.6 5085 0.0 4.98

dS 2.96 1.3 5085 0.0 6.84

w=dN/dS 0.34 0.32 5085 0.0 4.45

w=dN/dS>=3 3.6 0.57 10 3.0 4.45

Most of the genes

are under purifying

selection

Only few genesmight be under

positive selection


49/66

Codon volatility


50/66

A new concept: codons volatility (Plotkin et al. 2004. nature 428. p.942-945).

New method recently introduced, the utility of which is still

under debate;

has interresting consequences on the study of codon variability;

DetectingSelection


51/66

Detecting Selection

If a protein coding region of a nucleotide sequence has undergone

an excess number of amino-acid substitutions, then the region will

on average contain an overabundance of volatile codons,

compared with the genome as a whole.

Plotkin et al. Nature428; 942-945

Using the concept of codon volatility, we can scan an entire

genome to find genes that show significantly more, or less, pressurefor amino-acid substitutions than the genome as a whole.

If a gene contains many residues under pressurefor aa

replacements, then the resulting codons in that gene will on

average exhibit elevated volatility.If a gene is under purifying selectionnot to change its aa, then the

resulting sequence will on average exhibit lower volatility.

Codonsvolatility


52/66

Codons volatility

The codon CGA encoding arginine (R), has 8 potential ancestor codons (i.e.

non stop codon) that differ from CGA by one substitution. Volatility of a codon is defined as the proportion of nonsynonymous codons

over the total neighbour sense codons obtained by a single substitution.

The volatility of CGA = 4/8.

The volatility of AGA also encodes an arginine = 6/8.

12 3

4

5

67

81

2

3

4

5

67

8

Plotkin et al. 2004.

Nature428. p.942-945

Codonsvolatility


53/66

Codons volatility

22 codons have at least one synonymous with a different volatility;

Volatility of a codon c:

v(c) = 1/n {D[aacid(c) - aacid(ci)];i=1,n};

n is the number of neighbors (other than non-stop codons) thatcan mutate by a single substitution.

D is the Hamming distance = 0 if the 2 aa are identical;

=1 otherwise.

Volatility of a gene G:

v(G) = {v(ck);k=1,l};l is the number of codons in the gene G.

C d ltilit


54/66

Codons volatility

Volatility is used to quantify the probability that the most recent

substitution of a site caused an amino-acid change.

Each genes observed volatility is compared with a bootstrap

distribution of alternative synonymous sequences, drawn

according to the background codon usage in the genome,

and its significance statistically assessed.

Randomization procedure controls for the genes length and

amino-acid composition.

The volatility of a gene G is defined as the sum of the volatility

of its codons.

C d ltilit


55/66

Codons volatility

Volatility p-value of G:

The observed v(G) is compared with a bootstrap distribution of106synonymous versions of the gene G.

In each randomization sample, a nucleotide sequence G is

constructed so that it has the same translation as G but whose

codons are drawn randomly according to the relative frequenciesof synonymous codons in the whole genome.

p-value for G = proportion of randomized samples;

so that v(G) > v(G).

1-p is a p-value that tests whether a gene is significantly less

volatile than the genome as a whole.

DetectingSelection


56/66

Detecting Selection

A p-value near zero indicates significantly elevated volatility,

whereas a p-value near one indicates significantly depressedvolatility.

The probability that a sites most recent substitution caused a

non-synonymous change is:-greaterfor a site under positive selection;

-smallerfor a site under negative (purifying) selection.

http://www.cgr.harvard.edu/volatility

1) Paul M. SharpGene "volatility" is Most Unlikely to Reveal Adaptation

Ad A bli h d b 22 2004


57/66

MBEAdvance Access published on December 22, 2004.

doi:10.1093/molbev/msi073

2) Tal Dagan and Dan Graur

The Comparative Method Rules! Codon Volatility Cannot Detect Positive Darwinian Selection Using a Single Genome Sequence

MBEAdvance Access published on November 3, 2004.


3) Robert Friedman and Austin L. Hughes

Codon Volatility as an Indicator of Positive Selection: Data from Eukaryotic Genome Comparisons

MBEAdvance Access originally published on November 3, 2004. This version published November 8, 2004.


4) Hahn MW, Mezey JG, Begun DJ, Gillespie JH, Kern AD, Langley CH, Moyle LC.

Evolutionary genomics: Codon bias and selection on single genomes.

Nature. 2005 Jan 20;433(7023):E5-6.

5) Nielsen R, Hubisz MJ.

Evolutionary genomics: Detecting selection needs comparative data.

Nature. 2005 Jan 20;433(7023):E6.

6) Chen Y, Emerson JJ, Martin TM

Evolutionary genomics: Codon volatility does not detect selection.

Nature. 2005 Jan 20;433(7023):E6-7.

7) Zhang J, 2005.

On the evolution of codon volatility

Genetics169: 495-501.

8) Plotkin JB, Dushoff J, Fraser HB.

Evolutionary genomics: Codon volatility does not detect selection (reply).

Nature. 2005 Jan 20;433(7023):E7-8.

9) Plotkin JB, Dushoff J, Desai MM and Fraser HBSynonymous codon and selection on proteins

-> Volatility is not adequate for

predicting selection;

-> Extreme volatility classes have

interesting properties, in terms of aacomposition or codon bias;

-> Volatility may be another measure

of codon bias;

-> Authors : some genes are under

more positive, or less negative,

selection than others.


58/66

Codon Volatility (simple substitution model):

Codons and volatility under simple substitution modelCodons and volatility under simple substitution model

aa A R N D C Q E G H I L K M F P S T W Y V taa daa Vol G+C A+T

A GCT 3 1 1 1 1 1 1 9 6 0.67 2 1

A GCC 3 1 1 1 1 1 1 9 6 0.67 3 0

A GCA 3 1 1 1 1 1 1 9 6 0.67 2 1

A GCG 3 1 1 1 1 1 1 9 6 0.67 3 0


59/66

A GCG 3 1 1 1 1 1 1 9 6 0.67 3 0

R CGT 3 1 1 1 1 1 1 9 6 0.67 2 1

R CGC 3 1 1 1 1 1 1 9 6 0.67 3 0

R CGA 4 1 1 1 1 8 4 0.5 2 1

R CGG 4 1 1 1 1 1 9 5 0.56 3 0

R AGA 2 1 1 1 2 1 8 6 0.75 1 2

R AGG 2 1 1 1 2 1 1 9 7 0.78 2 1

N AAT 1 1 1 1 2 1 1 1 9 8 0.89 0 3

N AAC 1 1 1 1 2 1 1 1 9 8 0.89 1 2

D GAT 1 1 1 2 1 1 1 1 9 8 0.89 1 2

D GAC 1 1 1 2 1 1 1 1 9 8 0.89 2 1

C TGT 1 1 1 1 2 1 1 8 7 0.88 1 2

C TGC 1 1 1 1 2 1 1 8 7 0.88 2 1Q CAA 1 1 1 2 1 1 1 8 7 0.88 1 2

Q CAG 1 1 1 2 1 1 1 8 7 0.88 2 1

E GAA 1 2 1 1 1 1 1 8 7 0.88 1 2

E GAG 1 2 1 1 1 1 1 8 7 0.88 2 1

G GGT 1 1 1 1 3 1 1 9 6 0.67 2 1

G GGC 1 1 1 1 3 1 1 9 6 0.67 3 0

G GGA 1 2 1 3 1 8 5 0.63 2 1

G GGG 1 2 1 3 1 1 9 6 0.67 3 0

H CAT 1 1 1 2 1 1 1 1 9 8 0.89 1 2

H CAC 1 1 1 2 1 1 1 1 9 8 0.89 2 1

I ATT 1 2 1 1 1 1 1 1 9 7 0.78 0 3

I ATC 1 2 1 1 1 1 1 1 9 7 0.78 1 2

I ATA 1 2 2 1 1 1 1 9 7 0.78 0 3

L TTA 1 2 2 1 1 7 5 0.71 0 3

L TTG 2 1 2 1 1 1 8 6 0.75 1 2

L CTT 1 1 1 3 1 1 1 9 6 0.67 1 2

L CTC 1 1 1 3 1 1 1 9 6 0.67 2 1

L CTA 1 1 1 4 1 1 9 5 0.56 1 2

L CTG 1 1 4 1 1 1 9 5 0.56 2 1

K AAA 1 2 1 1 1 1 1 8 7 0.88 0 3

K AAG 1 2 1 1 1 1 1 8 7 0.88 1 2

M ATG 1 3 2 1 1 1 9 9 1. 1 2

F TTT 1 1 3 1 1 1 1 9 8 0.89 0 3

F TTC 1 1 3 1 1 1 1 9 8 0.89 1 2

P CCT 1 1 1 1 3 1 1 9 6 0.67 2 1

P CCC 1 1 1 1 3 1 1 9 6 0.67 3 0

P CCA 1 1 1 1 3 1 1 9 6 0.67 2 1

P CCG 1 1 1 1 3 1 1 9 6 0.67 3 0

S TCT 1 1 1 1 3 1 1 9 6 0.67 1 2

S TCC 1 1 1 1 3 1 1 9 6 0.67 2 1

S TCA 1 1 1 3 1 7 4 0.57 1 2S TCG 1 1 1 3 1 1 8 5 0.63 2 1

S AGT 3 1 1 1 1 1 1 9 8 0.89 1 2

S AGC 3 1 1 1 1 1 1 9 8 0.89 2 1

T ACT 1 1 1 1 2 3 9 6 0.67 1 2

T ACC 1 1 1 1 2 3 9 6 0.67 2 1

T ACA 1 1 1 1 1 1 3 9 6 0.67 1 2

T ACG 1 1 1 1 1 1 3 9 6 0.67 2 1

W TGG 2 2 1 1 1 7 7 1. 2 1

Y TAT 1 1 1 1 1 1 1 7 6 0.86 0 3

Y TAC 1 1 1 1 1 1 1 7 6 0.86 1 2

V GTT 1 1 1 1 1 1 3 9 6 0.67 1 2

V GTC 1 1 1 1 1 1 3 9 6 0.67 2 1

V GTA 1 1 1 1 2 3 9 6 0.67 1 2

V GTG 1 1 1 2 1 3 9 6 0.67 2 1

Tot 36 54 18 18 18 18 18 36 18 27 54 18 9 18 36 54 36 9 18 36

C d V l tilit St d d G ti C d


60/66

Codons Volatility: Standard Genetic Code

0.4

0.5

0.6

0.7

0.8

0.9

1

RCGA

RCGG

LCTA

LCTG

STCA

GGGA

STCG

AGCT

AGCC

AGCA

AGCG

RCGT

RCGC

GGGT

GGGC

GGGG

LCTT

LCTC

PCCT

PCCC

PCCA

PCCG

STCT

STCC

TACT

TACC

TACA

TACG

VGTT

VGTC

VGTA

VGTG

LTTA

LTTG

RAGA

RAGG

IATT

IATC

IATA

YTAT

YTAC

CTGT

CTGC

QCAA

QCAG

EGAA

EGAG

KAAA

KAAG

HCAT

HCAC

NAAT

NAAC

DGAT

DGAC

TTT

TTC

SAGT

SAGC

!ATG

"TGG

AA#Codons

Arg Gly Leu Ser

12 distinct volatility values

only 4 aa contain synonymous codons (22) of different volatilities

Vol 0 1 2 3


61/66


0 1 ( *

0.)

0.'

0.,

1.0

G+CSpearman r = 0.4312

p < 0.0005

0.5 1

0.56 1 1 1

0.57 1

0.63 2

0.67 6 12 7

0.71 1

0.75 2

0.78 2 1 1

0.86 1 1

0.88 1 4 3

0.89 2 5 3

1. 1 1

Standard Genetic Code 0 1 2 3Vol


62/66


0 1 ( * )

0.)

0.$

0.'

0.+

0.,

0.-

1.0

A+T

Spearman r = 0.4283

p < 0.0006

0 1 2 3

1

1 1 1

1

2

7 12 6

1

2

1 1 3

1 1

3 4 13 5 2

1 1

Vol

0.5

0.56

0.57

0.63

0.67

0.71

0.75

0.78

0.86

0.880.89

1.
http://../SELECTION_VOLATILEcodons/STATS/Vol_tab.dochttp://../SELECTION_VOLATILEcodons/STATS/Vol_tab.doc


63/66



64/66


References:


65/66

References:

Ziheng Yang and Rasmus Nielsen (2000)

Estimating synonymous and nonsynonymous substitution rates under realistic

evolutionary models.Mol Biol Evol.17:32-43.

Yang Z. and Bielawski J.P. (2000)

Statistical methods for detecting molecular adaptation

Trends Ecol Evol.15:496-503.Phylogenetic Analysis by Maximum Likelihood (PAML)

http://abacus.gene.ucl.ac.uk/software/paml.html

Plotkin JB, Dushoff J, Fraser HB (2004)

Detecting selection using a single genome sequence of M. tuberculosis and P.falciparum.Nature 428:942-5.

Molecular Evolution; A phylogenetic Approach

Page, RDM and Holmes, EC (Blackwell Science, 2004)

Sharp, PM & Li WH (1987). NAR 15:p.1281-1295.


66/66

References

MEGA: http://www.megasoftware.net/

PAML: http://abacus.gene.ucl.ac.uk/software/paml.html

Fundamental concepts of Bioinformatics.

Dan E. Krane and Michael L. Raymer

Genomes 2 edition. T.A. Brown

Phylogeny programs :

http://evolution.genetics.washington.edu/phylip/sftware.html

Books:

Molecular Evolution; A phylogenetic Approach

Page, RDM and Holmes, EC

S i

207945773 Molecular Evolution

Documents