Top Banner

of 66

207945773 Molecular Evolution

Feb 24, 2018

Download

Documents

Nahrul Ney
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/24/2019 207945773 Molecular Evolution

    1/66

    Molecular evolution

    Dr. Yougesh [email protected]

  • 7/24/2019 207945773 Molecular Evolution

    2/66

    The increasing available completely sequenced

    organisms and the importance of evolutionary processes

    that affect the species history, have stressed the interest in

    studying the molecular evolution events at the sequence

    level.

    Molecular evolution

  • 7/24/2019 207945773 Molecular Evolution

    3/66

    Plan

    Context

    selection pressure (definitions)

    Genetic code and inherent properties of codons and

    amino-acids

    Estimations of synonymous and nonsynomynous

    substitutions

    Codons volatility

    Applications

  • 7/24/2019 207945773 Molecular Evolution

    4/66

    Ancestor

    species genome

    Evolutionary processes include:

    Phylogeny*

    duplication genesis

    Expansion*

    HGT HGT

    Exchange* loss Deletion*

    and selection

  • 7/24/2019 207945773 Molecular Evolution

    5/66

    Time Duplication

    Duplication

    Speciation

    Speciation

    AB C

    A B C

    Species tree

    A B C

    Gene tree

    Gene tree - Species tree

    Genomes 2 edition 2002.. T.A. Brown

  • 7/24/2019 207945773 Molecular Evolution

    6/66

    Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS iol 2(!): e20".

    Original version

    Actual version

  • 7/24/2019 207945773 Molecular Evolution

    7/66

    Homolog - Paralog - Ortholog

    A1A2

    B1

    B2

    Homologs: A1, B1, A2, B2

    Paralogs: A1vs B1and A2vs B2

    Orthologs: A1vs A2and B1vs B2

    S1 S2a b

    A

    O

    B

    Species-1Species-2

    A1A2

    B1

    B2

    Sequence analysis

  • 7/24/2019 207945773 Molecular Evolution

    8/66

    Molecular evolutionary analysis

    Aim at understanding and modeling evolutionary

    events;

    Evolutionary modeling extrapolates from the divergence

    between sequences that are assumed homologous, thenumber of events which have occurred since the genes

    diverged;

    If rate of evolution is known, then a time sincedivergence can be estimated.

  • 7/24/2019 207945773 Molecular Evolution

    9/66

    Applications:

    Molecular evolution analysis has clarified:

    the evolutionary relationships between humans and

    other primates;

    the origins of AIDS;

    the origin of modern humans and population migration;

    speciation events;

    genetic material exchange between species.

    origin of some deseases (cancer, etc...)

    .....

    Molecular evolution

  • 7/24/2019 207945773 Molecular Evolution

    10/66

    GACGACCATAGACCAGCATAG

    GACTACCATAGA-CTGCAAAG

    *** ******** * *** **

    GACGACCATAGACCAGCATAG

    GACTACCATAGACT-GCAAAG

    *** ********* *** **

    Two possible

    positions for theindel

    Molecular evolution

  • 7/24/2019 207945773 Molecular Evolution

    11/66

    Molecular evolution

    Mutations arise due to inheritable changes in

    genomic DNA sequence;

    Mechanisms which govern changes at the

    protein level are most likely due to nucleotide

    substitution or insertions/deletions;

    Changes may give rise to new genes which

    become fixed if they give the organism an

    advantage in selection;

    GACGACCATAGACCAGCATAG

    GACTACCATAGA-CTGCAAAG

  • 7/24/2019 207945773 Molecular Evolution

    12/66

    Molecular evolution: Definitions

    Purifying (negative) selection

    A consequence of gene drift through random

    mutations, is that many mutations will have deleterious

    effects on fitness.

    Purifying selective force prevents accumulation ofmutation at important functional sites, resulting in

    sequence conservation.

    -> Purifying selection is a natural selection againstdeleterious mutations.

    -> The term is used interchangeably with negative

    selectionor selection constraints.

  • 7/24/2019 207945773 Molecular Evolution

    13/66

    Neutral theory

    Majority of evolution at the molecular level is caused byrandom genetic drift through mutations that are

    selectively neutral or nearly neutral.

    Describes cases in which selection (purifyingor positive)

    is not strong enough to outweigh random events.

    Neutral mutation is an ongoing process which gives rise

    to genetic polymorphisms; changes in environment can

    select for certain of these alleles.

  • 7/24/2019 207945773 Molecular Evolution

    14/66

    Positive selection

    Positive selection is a darwinian selection fixing

    advantageous mutations.

    The term is used interchangeably with molecular

    adaptation and adaptive molecular evolution.

    Positive selection can be shown to play a role in someevolutionary events

    This is demonstrated at the molecular level if the rate of

    nonsynonymous mutation at a site is greater than the rateof synonymous mutation

    Most substitution rates are determined by either neutral

    evolution of purifying selection against deleteriousmutations

  • 7/24/2019 207945773 Molecular Evolution

    15/66

    Molecular evolution

    We observe and try to decode the process of

    molecular evolution from the perspective of

    accumulated differences among related genes

    from one or diverse organisms.

    The number of mutations that have occurred

    can only be estimated.

    Real individual events are blurred by a long

    history of changes.

  • 7/24/2019 207945773 Molecular Evolution

    16/66

    -GGAGCCATATTAGATAGA-

    -GGAGCAATTTTTGATAGA-

    Gly Ala Ile Leu asp Arg

    Gly Ala Ile Pheasp Arg

    DNA yields more phylogenetic information than proteins. The

    nucleotide sequences of a pair of homologous genes have a higherinformation content than the amino acid sequences of the

    corresponding proteins, because mutations that result in synonymous

    changes alter the DNA sequence but do not affect the amino acid

    sequence.

    3 different DNA positions but

    only one different amino acid

    position:

    2 of the nucleotide substitutions

    are therefore synonymousandone is non-synonymous.

    Nucleotide, amino-acid sequences

    -> gene

    -> protein

  • 7/24/2019 207945773 Molecular Evolution

    17/66

    Kinds of nucleotide substitutionsGiven 2 nucleotide sequences, we can ask how their similarities and

    differences arose from a common ancestor?

    A

    A

    C

    Single substitution

    1 change, 1 difference

    T

    A

    A

    C

    Multiple substitution

    2 changes, 1 difference

    A

    C

    G

    Coincidental substitution

    2 change, 1 difference

    A

    C

    C

    Parallel substitution

    2 changes, no difference

    A

    T

    T

    C

    Convergent substitution

    3 changes, no difference

    A

    A

    AC

    Back substitution

    2 changes, no difference

  • 7/24/2019 207945773 Molecular Evolution

    18/66

    Substitution: Transition - transversion

    transitionchanges one

    purinefor another or onepyrimidinefor another.

    transversionchanges apurinefor a pyrimidineor

    vice versa.

    Nucleotides are either purineor pyrimidines:

    G(Guanine) and A(Adenine) are called purine;

    C(Cytosine) and T(Thymine) are called pyrimidines.

    transitionsoccur at least 2 times as frequently as transversions

    A G

    C T

  • 7/24/2019 207945773 Molecular Evolution

    19/66

    Standard genetic code

    The genetic code specifies how a combination of any ofthe four bases (A,G,C,T) produces each of the 20 amino

    acids.

    The triplets of bases are called codonsand with fourbases, there are 64 possible codons:

    (43

    ) possible codons that code for 20 amino acids (and stopsignals).

  • 7/24/2019 207945773 Molecular Evolution

    20/66

    Second position

    | T | C | A | G |

    ----+--------------+--------------+--------------+--------------+----

    | TTT Phe (F) | TCT Ser (S)| TAT Tyr (Y) | TGT Cys (C)| T

    T| TTC " | TCC " | TAC | TGC | C F | TTA Leu (L)| TCA " | TAA Ter | TGA Ter | A T

    i | TTG " | TCG " | TAG Ter | TGG Trp (W)| G h

    r --+--------------+--------------+--------------+--------------+-- i

    s | CTT Leu (L)| CCT Pro (P) | CAT His (H) | CGT Arg (R) | T r

    t C| CTC " | CCC " | CAC " | CGC " | C d

    | CTA " | CCA " | CAA Gln (Q) | CGA " | A

    P | CTG " | CCG " | CAG " | CGG " | G P

    o --+--------------+--------------+--------------+--------------+-- o

    s | ATT Ile (I)| ACT Thr (T) | AAT Asn (N) | AGT Ser (S)| T s

    i A| ATC " | ACC " | AAC " | AGC " | C i

    t | ATA " | ACA " | AAA Lys (K) | AGA Arg (R) | A t

    i | ATG Met (M) | ACG " | AAG " | AGG " | G i

    o --+--------------+--------------+--------------+--------------+-- o

    n | GTT Val (V) | GCT Ala (A) | GAT Asp (D) | GGT Gly (G) | T n

    G| GTC " | GCC " | GAC " | GGC " | C

    | GTA " | GCA " | GAA Glu (E) | GGA " | A

    | GTG " | GCG " | GAG " | GGG " | G

    ----+--------------+--------------+--------------+--------------+----

    Standard genetic code

    charg(basique),charg(acidique),

    hydrophile,hydrophobe

    A Ala Alanine GCT GCC GCA GCG

    R Arg Arginine CGT CGC CGA CGG AGA AGG

    N Asn Asparagine AAT AACD Asp Aspartic acid GAT GAC

    C Cys Cysteine TGT TGC

    Q Gln Glutamine CAA CAG

    E Glu Glutamic acid GAG GAA

    G Gly Glycine GGG GGA GGT GGC

    H His Histidine CAT CAC

    I Ile Isoleucine ATT ATC ATA

    L Leu Leucine TTA TTG CTT CTC CTA CTG

    K Lys Lysine AAA AAG

    M Met Methionine ATG

    F Phe Phenylalanine TTT TTC

    P Pro Proline CCT CCC CCA CCG

    S Ser Serine TCT TCC TCA TCG AGT AGC

    T Thr Threonine ACT ACC ACA ACG

    W Trp Tryptophan TGG

    Y Tyr Tyrosine TAT TAC

    V Val Valine GTT GTC GTA GTG

    Because there are only 20 amino acids, but 64 possible codons, the same aminoacid is often encoded by a number of different codons, which usually differ in the

    third base of the triplet.

    Because of this repetition the genetic code is said to be degenerateand codons

    which produce the same amino acid are called synonymous codons.

  • 7/24/2019 207945773 Molecular Evolution

    21/66

    Important properties inherent to

    the standard genetic code

  • 7/24/2019 207945773 Molecular Evolution

    22/66

    Synonymous vs nonsynonymous substitutions

    Nondegenerate sites: are codon position where mutations always

    result in amino acid substitutions.

    (exp. TTT(Phenylalanyne, CTT(leucine), ATT(Isoleucine), and

    GTT(Valine)).

    Twofold degenerate sites: are codon positions where 2 different

    nucleotides result in the translation of the same aa, but the 2 otherscode for a different aa.

    (exp. GATand GACcode for Aspartic acid (asp, D),

    whereas GAAand GAGboth code for Glutamic acid (glu, E)).

    Threefold degenerate site: are codon positions where changing 3

    of the 4 nucleotides has no effect on the aa, while changing the

    fourth possible nucleotide results in a different aa.

    There is only 1 threefold degenerate site: the 3rdposition of an isoleucine codon.ATT ATC or ATAall encode isoleucine but ATGencodes methionine.

  • 7/24/2019 207945773 Molecular Evolution

    23/66

    Standard genetic code

    Three amino acids: Arginine, Leucine and Serine are encoded by 6 different

    codons:

    R Arg Arginine CGT CGC CGA CGG AGA AGG

    L Leu Leucine TTA TTG CTT CTC CTA CTG

    S Ser Serine TCT TCC TCA TCG AGT AGC

    Five amino-acids are encoded by 4 codons which differ only in the third position.

    These sites are called fourfold degenerate sites

    A Ala Alanine GCT GCC GCA GCG

    G Gly Glycine GGG GGA GGT GGC

    P Pro Proline CCT CCC CCA CCG

    T Thr Threonine ACT ACC ACA ACG

    V Val Valine GTT GTC GTA GTG

    Fourfold degenerate sites: are codon positions where changing a

    nucleotide in any of the 3 alternatives has no effect on the aa.

    exp. GGT, GGC,GGA, GGG(Glycine);

    CCT,CCC,CCA,CCG(Proline)

  • 7/24/2019 207945773 Molecular Evolution

    24/66

    Standard genetic codeNine amino acids are encoded by a pair of codons which differ by a transition

    substitution at the third position. These sites are called twofold degenerate sites.

    Isoleucine is encoded by three different codons

    Methionine and Triptophan are encoded by single codon

    Three stop codons: TAA, TAG and TGA

    N Asn Asparagine AAT AAC

    D Asp Aspartic acid GAT GAC

    C Cys Cysteine TGT TGC

    Q Gln Glutamine CAA CAG

    E Glu Glutamic acid GAG GAA

    H His Histidine CAT CAC

    K Lys Lysine AAA AAGF Phe Phenylalanine TTT TTC

    Y Tyr Tyrosine TAT TAC

    I Ile Isoleucine ATT ATC ATA

    M Met Methionine ATG

    W Trp Tryptophan TGG

    Transition:

    A/G; C/T

  • 7/24/2019 207945773 Molecular Evolution

    25/66

    Nucleotide substitutions in protein coding genes can be divided into : synonymous (or silent) substitutions i.e. nucleotide substitutions

    that do not result in amino acid changes.

    non synonymous substitutions i.e. nucleotide substitutions that

    change amino acids.

    nonsense mutations, mutations that result in stop codons.

    exp: Gly: any changes in 3rd position of codon results in Gly; any

    changes in second position results in amino acid changes; and so isthe first position.

    Standard Genetic Code

    GAG

    G Gly Glycine GGG GGA GGT GGC

    Glu AGC Serexp:

  • 7/24/2019 207945773 Molecular Evolution

    26/66

    Estimation of synonymous and nonsynonymous substitution rates

    is important in understanding the dynamics of molecular sequence

    evolution.

    As synonymous(silent) mutations are largely invisible to natural

    selection, while nonsynonymous(amino-acid replacing) mutations

    may be under strong selective pressure, comparison of the rates of

    fixation of those two types of mutations provides a powerful tool for

    understanding the mechanisms of DNA sequence evolution.

    For example, variable nonsynonymous/synonymous rate ratios

    among lineages may indicate adaptative evolutionor relaxed

    selective constraintsalong certain lineages.

    Likewise, models of variable nonsynonymous/synonymous rate

    ratios among sites may provide important insights into functional

    constraints at different amino acid sites and may be used to detectsites under positive selection.

    Nonsynonymous/synonymous substitutions

  • 7/24/2019 207945773 Molecular Evolution

    27/66

    Codon usage

    If nucleotide substitution occurs at random at each nucleotide site,

    every nucleotide site is expected to have one of the 4 nucleotides, A,

    T, C and G, with equal probability.Therefore, if there is no selection and no mutation bias, one would

    expect that the codons encoding the same amino acid are on average

    in equal frequencies in protein coding regions of DNA.

    In practice, the frequencies of different codons for the same aminoacid are usually different, and some codons are used more often than

    others. This codon usage bias is often observed.

    Codon usage bias is controlled by both mutation pressureand

    purifying selection.

    There are 64 (43) possible codons that code for 20 amino acids

    (and stop signals).

  • 7/24/2019 207945773 Molecular Evolution

    28/66

    For a pair of homologous codons presenting only one nucleotide

    difference, the number of synonymous and nonsynonymoussubstitutions may be obtained by simple counting of silent versus

    non silent amino acid changes;

    For a pair of codons presenting more than one nucleotide

    difference, distinction between synonymous and nonsynonymoussubstitutions is not easy to calculate and statistical estimation

    methods are needed;

    For example, when there are 3 nucleotide differences between

    codons, there are 6 different possible pathways between thesecodons. In each path there are 3 mutational steps.

    More generally there can be many possible pathways between

    codons that differ at all three positions sites; each pathway has its

    own probability.

    Estimating synonymous and nonsynonymous differences

  • 7/24/2019 207945773 Molecular Evolution

    29/66

    Observed nucleotide differences between 2 homologous sequences

    are classified into 4 categories: synonymous transitions, synonymoustransversions, nonsynonymous transitionsand nonsynonymous

    transversions.

    When the 2 compared codons differ at one position, the

    classification is obvious.

    When they differ at 2 or 3 positions, there will be 2 of 6

    parsimonious pathways along which one codon could change into the

    other, and all of them should be considered.

    Estimating synonymous and nonsynonymous differences

    Since different pathways may involve different numbers of

    synonymous and nonsynonymous changes, they should be weighted

    differently.

  • 7/24/2019 207945773 Molecular Evolution

    30/66

    SEQ.1 GAAGTTTTT

    SEQ.2 GACGTCGTA

    Glu Val Phe

    Asp Val Val

    Codon 1: GAA--> GAC;1 nuc. diff., 1 nonsynonymous difference;

    Codon 2: GTT--> GTC;1 nuc. diff., 1 synonymous difference;

    Codon 3: counting is less straightforward:

    TTT(F:Phe)

    GTT(V:Val)TTA(L:Leu)

    GTA(V:Val)

    1

    2

    Path 1: implies 1

    non-synonymous

    and 1 synonymoussubstitutions;

    Path 2: implies 2

    non synonymous

    substitutions;

    Example: 2 homologous sequences

  • 7/24/2019 207945773 Molecular Evolution

    31/66

    Evolutionary Distance estimationbetween 2 sequences

    The simplest problem is the estimation of the number of

    synonymous (dS) and nonsynonymous (dN) substitutions per sitebetween 2 sequences:

    the number of synonymous (S) and nonsynonymous (N) sites in the

    sequences are counted;

    the number of synonymous and nonsynonymous differences

    between the 2 sequences are counted;

    a correction for multiple substitutions at the same site is applied to

    calculate the numbers of synonymous (dS) and nonsynonymous(dN) substitutions per site between the 2 sequences.

    ==> many estimation Methods

  • 7/24/2019 207945773 Molecular Evolution

    32/66

    Evolutionary Distance estimation

    In general the genetic code affords fewer opportunities for

    nonsynonymous changes than for synonymous changes.

    rate of synonymous >>rate of nonsynonymous substitutions.

    Furthermore, the likelihood of either type of mutation is highly dependent on

    amino acid composition.

    For example: a protein containing a large number ofleucineswill contain manymore opportunities for synonymous change than will a protein with a high

    number of lysines.

    L Leu Leucine TTA TTG CTT CTC CTA CTG4forld degeneratesite

    2fold degenerate siteSeveral possible substitutions that will not change the aaLeucine

    K Lys Lysine AAA AAG

    Only one possible mutation at 3rd position that will not changeLysine

  • 7/24/2019 207945773 Molecular Evolution

    33/66

    Evolutionary Distance estimation

    Fundamental for the study of protein evolution and useful for

    constructing phylogenetic trees and estimation of divergence time.

  • 7/24/2019 207945773 Molecular Evolution

    34/66

    QuickTime et un dcompresseur TIFF (non compress) sont requis pour visionner cette image.

    Ziheng Yang & Rasmus Nielsen (2000)

    Estimating synonymous and nonsynonymous substitution rates under

    realistic evolutionary models.Mol Biol Evol.17:32-43.

    Estimating synonymous and nonsynonymous substitution rates

    P if i l ti

  • 7/24/2019 207945773 Molecular Evolution

    35/66

    Purifying selection:

    Most of the time selection eliminates deleterious mutations, keeping

    the protein as it is.

    Positive selection:

    In few instances we find that dN(also denoted Ka) is much greater

    than dS(also denoted Ks) (i.e. dN/dS>> 1 (Ka/Ks >>1 )). This is strong

    evidence that selection has acted to change the protein.Positive selection was tested for by comparing the number of nonsynonymous substitutions pernonsynonymous site (d

    N) to the number of synonymous substitutions per synonymous site (d

    S). Because

    these numbers are normalized to the number of sites, if selection were neutral (i.e., as for a

    pseudogene) the dN/d

    Sratio would be equal to . !n unequivocal sign of positive selection is a d

    N/d

    S

    ratio significantly e"ceeding , indicating a functional benefit to diversify the amino acid sequence.

    dN/dS< 0.25indicates purifying selection;

    dN/dS= 1suggests neutral evolution;

    dN/dS>> 1indicates positive selection.

  • 7/24/2019 207945773 Molecular Evolution

    36/66

    Negative (purifying) selection eliminates disadvantageous

    mutations i.e. inhibits protein evolution.

    (explains why dN< dSin most protein coding regions)

    Positive selectionis very important for evolution of new functions

    especially for duplicated genes.

    (must occur early after duplication otherwise null mutations and

    will be fixed producing pseudogenes).

    dN/dS(or Ka/Ks) measures selection pressure

  • 7/24/2019 207945773 Molecular Evolution

    37/66

    Mutational saturation

    Mutational saturation in DNA and protein sequencesoccurs when sites have undergone multiple mutations

    causing sequence dissimilarity (the observed differences)

    to no longer accurately reflect the true evolutionary

    distance i.e. the number of substitutions that haveactually occurred since the divergence of two sequences.

    Correct estimation of the evolutionary distance is crucial.

    Generally: sequences where dS > 2 are excluded to avoid

    the saturation effect of nucleotide substitution.

  • 7/24/2019 207945773 Molecular Evolution

    38/66

    PAML: Phylogenetic Analysis by Maximum Likelihood (PAML)

    http://abacus.gene.ucl.ac.uk/software/paml.html

    YN00 - P13.4.C13.18.fa.paml

    ns = 13 ls = 29

    Estimation by the method of:

    Yang & Nielsen (2000):

    seq. seq. S N t kappa omega dN +- SE dS +- SE

    YALI0A08195g YALI0A17963g 15.1 71.9 0.37 1.31 0.20 0.07 +- 0.03 0.36 +- 0.22

    YALI0E25443g YALI0A17963g 17.3 69.7 1.8 1.31 0.05 0.13 +- 0.05 2.55 +- 13.95YALI0E25443g YALI0A08195g 17.6 69.4 1.00 1.31 0.06 0.08 +- 0.03 1.35 +- 0.70

    YALI0C21230g YALI0A17963g 24.1 62.9 5.35 1.31 0.75 1.63 +- 1.06 2.19 +- 1.70

    YALI0C21230g YALI0A08195g 24.5 62.5 6.58 1.31 0.57 1.81 +- 1.43 3.19 +- 6.21

    YALI0C21230g YALI0E25443g 24.9 62.1 4.76 1.31 1.27 1.69 +- 0.57 1.33 +- 0.59

    YALI0C21230g YALI0A02783g 24.6 62.4 4.71 1.31 3.58 1.97 +- 0.81 0.55 +- 0.21

    YALI0C21230g YALI0C21252g 25.4 61.6 6.64 1.31 3.22 2.77 +- 2.27 0.86 +- 0.32

    YALI0C21230g YALI0C21274g 25.3 61.7 6.54 1.31 3.46 2.75 +- 2.21 0.79 +- 0.34

    YALI0C21230g YALI0F09944g 24.3 62.7 7.51 1.31 2.31 2.97 +- 2.93 1.29 +- 1.09

    YALI0C21230g YALI0A13497g 28.2 58.8 7.13 1.31 3.20 3.06 +- 3.38 0.95 +- 0.34. ..

    YALI0C21230g YALI0B06160g 27.1 59.9 7.34 1.31 1.66 2.79 +- 2.37 1.68 +- 0.86

    YALI0D11638g YALI0C21230g 27.3 59.7 8.04 1.31 1.68 3.07 +- 3.40 1.83 +- 1.39

    YALI0E19140g YALI0C21230g 25.2 61.8 7.67 1.31 2.48 3.09 +- 3.46 1.25 +- 0.54

    YALI0E19140g YALI0D11638g 22.4 64.6 4.12 1.31 0.45 1.04 +- 0.29 2.33 +- 2.13

    -> yn00 similar results than ML (Yang & Nielsen (2000))

    -> advantage : easy automation for large scale comparisons;

  • 7/24/2019 207945773 Molecular Evolution

    39/66

    Relative Rate Test

    1 2 3

    A

    For determining the relative rate of

    substitution in species 1 and 2, we need and

    outgroup (species 3).

    The point in time when 1 and 2 diverged is

    marked A (common ancestor of 1 and 2).The number of substitutions between any two species is assumed to

    be the sum of the number of substitutions along the branches of the

    tree connecting them:

    d13=dA1+dA3

    d23=dA2+dA3

    d12=dA1+dA2

    d13, d23and d12are measures of the differences

    between 1 and 3, 2 and 3 and 1 and 2 respectively.

    dA1=(d12+d13-d23)/2

    dA2=(d12+d23-d13)/2

    dA1and dA2should be the

    same (A common ancestor

    of 1 and 2).

  • 7/24/2019 207945773 Molecular Evolution

    40/66

    Evolution of functionally important regions over time. Immediately after a speciation event, the two copies of the

    genomic region are 100% identical (see graph on left). Over time, regions under little or no selective pressure,

    such as introns, are saturated with mutations, whereas regions under negative selection, such as most eons,

    retain a higher percent identity (see graph on right). !any se"uences involved in regulating gene epression

    also maintain a higher percent identity than do se"uences with no function.

    COMPARATIVE GENOMICS

    Webb Miller, Kateryna D. Makova, Anton Nekrutenko, and Ross C. Hardison

    Annual Review of Genomics and Human Genetics

    #ol. $ 1$&$' (00)

  • 7/24/2019 207945773 Molecular Evolution

    41/66

    Yang & Nielsen,

    Esimating Synonymous and Nonsynonymous Substitution Rates Under

    Realistic Evolutionary Models

    Mol. Biol. Evol. 2000, 17:32-43

    =>Other estimation Models

    Reference

    E l ti Di t ti ti b t 2

  • 7/24/2019 207945773 Molecular Evolution

    42/66

    Evolutionary Distance estimationbetween 2 sequences

    Under certain conditions, however, nonsynonymous substitution may be

    accelerated by positive Darwinian selection. It is therefore interesting to examine

    the number of synonymous differences per synonymous site and the number ofnonsynonymous differences per nonsynonymous site.

    p-distance:

    ps= Sd/S proportion of synonymous differences ;

    var(ps) = p

    s(1-p

    s)/S.

    pn= Nd/N proportion of non synonymous differences;

    var(pn) = pn(1-pn)/S.

    Sdand Ndare respectively the total number of synonymous and non

    synonymous differences calculated over all codons. S and N are the

    numbers of synonymous and nonsynonymous substitutions.

    S+N=n total number of nucleotides and N >> S.

    http://abacus.gene.ucl.ac.uk/software/paml.htmlhttp://abacus.gene.ucl.ac.uk/software/paml.html
  • 7/24/2019 207945773 Molecular Evolution

    43/66

    Substitutions between protein sequences

    p = nd/n

    V(p)=p(1-p)/n

    ndand n are the number of amino acid differences and the total number of

    amino acids compared.

    However, refining estimates of the number of substitutions that have occurred

    between the amino acid sequences of 2 or more proteins is generally more

    difficult than the equivalent task for coding sequences (see paths above).

    One solution is to weight each amino acid substitution differently by using

    empirical data from a variety of different protein comparisons to generate amatrix as the PAM matrix for example.

  • 7/24/2019 207945773 Molecular Evolution

    44/66

    Otherdistancemodels

  • 7/24/2019 207945773 Molecular Evolution

    45/66

    Jukes-Cantor model:A T C G

    A - l l lT l - l lC l l - lG l l l - l is the rate of substitution.

    Tajima-Nei model:A T C G

    A g dT g d

    C - dG g - ,, g and d are the rates of substitution.

    Kimura 2-parameters model:A T C G

    A

    T

    C andare the rates of transitional

    G and transvertional substitutions

    Tamura model:A T C G

    A - (1-q) q q

    T (1-q) - q q andare the rates of transitional

    C (1-q) (1-q) - q and transvertional substitutions

    G (1-q) (1-q) q - and q is the G+C content.

    Hasegawa et al. model:A T C G

    A - gT gC gG

    T gA - gC gG andare the rates of transitional

    C gA gT - gG and transvertional substitutions

    G gA gT gC - and gi the nucleotide frequencies

    (i=A,T,C,G).

    Tamura-Nei model:A T C G

    A - gT gC gG1 1and2are the rates of transitional substitutions

    T gA - gC2 gG between purines and between pyrimidines;

    C gA gT2 - gG is the rate of transvertional substitutions;

    G gA1 gT gC - and githe nucleotide frequencies (i=A,T,C,G).

    Other distance models

  • 7/24/2019 207945773 Molecular Evolution

    46/66

    Example: yn00 in PAML.

    Protein sequences in a family

    and corresponding DNA sequences

    Procedure

  • 7/24/2019 207945773 Molecular Evolution

    47/66

    1.Alignment of a family protein sequences usingclustalW

    2.Alignment of corresponding DNA sequences using as template their

    corresponding amino acid alignment obtained in step 1

    3.Format the DNA alignment in yn00 format

    4.Perform yn00 program (PAML package) on the obtained DNA alignment

    5.Clean the yn00 output to get YN (Yang & Nielsen) estimates in a file.

    Estimations with large standard errors were eliminated

    6.From YN estimates extract gene pairs with w = dN/d

    S>= 3 and gene pairs with

    w=3 are considered as candidate genes on which positive

    selection may operate. Whereas genes with w

  • 7/24/2019 207945773 Molecular Evolution

    48/66

    S N

    0.0 0.$ 1.0 1.$ .0 .$ *.0 *.$ .0 .$ $.0 $.$ '.00.00.$1.0

    1.$.0.$*.0*.$.0.$$.0$.$'.0'.$+.0+.$

    dN

    m std n min MaxdN 0.90 0.6 5085 0.0 4.98

    dS 2.96 1.3 5085 0.0 6.84

    w=dN/dS 0.34 0.32 5085 0.0 4.45

    w=dN/dS>=3 3.6 0.57 10 3.0 4.45

    Most of the genes

    are under purifying

    selection

    Only few genesmight be under

    positive selection

  • 7/24/2019 207945773 Molecular Evolution

    49/66

    Codon volatility

  • 7/24/2019 207945773 Molecular Evolution

    50/66

    A new concept: codons volatility (Plotkin et al. 2004. nature 428. p.942-945).

    New method recently introduced, the utility of which is still

    under debate;

    has interresting consequences on the study of codon variability;

    DetectingSelection

  • 7/24/2019 207945773 Molecular Evolution

    51/66

    Detecting Selection

    If a protein coding region of a nucleotide sequence has undergone

    an excess number of amino-acid substitutions, then the region will

    on average contain an overabundance of volatile codons,

    compared with the genome as a whole.

    Plotkin et al. Nature428; 942-945

    Using the concept of codon volatility, we can scan an entire

    genome to find genes that show significantly more, or less, pressurefor amino-acid substitutions than the genome as a whole.

    If a gene contains many residues under pressurefor aa

    replacements, then the resulting codons in that gene will on

    average exhibit elevated volatility.If a gene is under purifying selectionnot to change its aa, then the

    resulting sequence will on average exhibit lower volatility.

    Codonsvolatility

  • 7/24/2019 207945773 Molecular Evolution

    52/66

    Codons volatility

    The codon CGA encoding arginine (R), has 8 potential ancestor codons (i.e.

    non stop codon) that differ from CGA by one substitution. Volatility of a codon is defined as the proportion of nonsynonymous codons

    over the total neighbour sense codons obtained by a single substitution.

    The volatility of CGA = 4/8.

    The volatility of AGA also encodes an arginine = 6/8.

    12 3

    4

    5

    67

    81

    2

    3

    4

    5

    67

    8

    Plotkin et al. 2004.

    Nature428. p.942-945

    Codonsvolatility

  • 7/24/2019 207945773 Molecular Evolution

    53/66

    Codons volatility

    22 codons have at least one synonymous with a different volatility;

    Volatility of a codon c:

    v(c) = 1/n {D[aacid(c) - aacid(ci)];i=1,n};

    n is the number of neighbors (other than non-stop codons) thatcan mutate by a single substitution.

    D is the Hamming distance = 0 if the 2 aa are identical;

    =1 otherwise.

    Volatility of a gene G:

    v(G) = {v(ck);k=1,l};l is the number of codons in the gene G.

    C d ltilit

  • 7/24/2019 207945773 Molecular Evolution

    54/66

    Codons volatility

    Volatility is used to quantify the probability that the most recent

    substitution of a site caused an amino-acid change.

    Each genes observed volatility is compared with a bootstrap

    distribution of alternative synonymous sequences, drawn

    according to the background codon usage in the genome,

    and its significance statistically assessed.

    Randomization procedure controls for the genes length and

    amino-acid composition.

    The volatility of a gene G is defined as the sum of the volatility

    of its codons.

    C d ltilit

  • 7/24/2019 207945773 Molecular Evolution

    55/66

    Codons volatility

    Volatility p-value of G:

    The observed v(G) is compared with a bootstrap distribution of106synonymous versions of the gene G.

    In each randomization sample, a nucleotide sequence G is

    constructed so that it has the same translation as G but whose

    codons are drawn randomly according to the relative frequenciesof synonymous codons in the whole genome.

    p-value for G = proportion of randomized samples;

    so that v(G) > v(G).

    1-p is a p-value that tests whether a gene is significantly less

    volatile than the genome as a whole.

    DetectingSelection

  • 7/24/2019 207945773 Molecular Evolution

    56/66

    Detecting Selection

    A p-value near zero indicates significantly elevated volatility,

    whereas a p-value near one indicates significantly depressedvolatility.

    The probability that a sites most recent substitution caused a

    non-synonymous change is:-greaterfor a site under positive selection;

    -smallerfor a site under negative (purifying) selection.

    http://www.cgr.harvard.edu/volatility

    1) Paul M. SharpGene "volatility" is Most Unlikely to Reveal Adaptation

    Ad A bli h d b 22 2004

  • 7/24/2019 207945773 Molecular Evolution

    57/66

    MBEAdvance Access published on December 22, 2004.

    doi:10.1093/molbev/msi073

    2) Tal Dagan and Dan Graur

    The Comparative Method Rules! Codon Volatility Cannot Detect Positive Darwinian Selection Using a Single Genome Sequence

    MBEAdvance Access published on November 3, 2004.

    doi:10.1093/molbev/msi033

    3) Robert Friedman and Austin L. Hughes

    Codon Volatility as an Indicator of Positive Selection: Data from Eukaryotic Genome Comparisons

    MBEAdvance Access originally published on November 3, 2004. This version published November 8, 2004.

    doi:10.1093/molbev/msi038

    4) Hahn MW, Mezey JG, Begun DJ, Gillespie JH, Kern AD, Langley CH, Moyle LC.

    Evolutionary genomics: Codon bias and selection on single genomes.

    Nature. 2005 Jan 20;433(7023):E5-6.

    5) Nielsen R, Hubisz MJ.

    Evolutionary genomics: Detecting selection needs comparative data.

    Nature. 2005 Jan 20;433(7023):E6.

    6) Chen Y, Emerson JJ, Martin TM

    Evolutionary genomics: Codon volatility does not detect selection.

    Nature. 2005 Jan 20;433(7023):E6-7.

    7) Zhang J, 2005.

    On the evolution of codon volatility

    Genetics169: 495-501.

    8) Plotkin JB, Dushoff J, Fraser HB.

    Evolutionary genomics: Codon volatility does not detect selection (reply).

    Nature. 2005 Jan 20;433(7023):E7-8.

    9) Plotkin JB, Dushoff J, Desai MM and Fraser HBSynonymous codon and selection on proteins

    -> Volatility is not adequate for

    predicting selection;

    -> Extreme volatility classes have

    interesting properties, in terms of aacomposition or codon bias;

    -> Volatility may be another measure

    of codon bias;

    -> Authors : some genes are under

    more positive, or less negative,

    selection than others.

  • 7/24/2019 207945773 Molecular Evolution

    58/66

    Codon Volatility (simple substitution model):

    Codons and volatility under simple substitution modelCodons and volatility under simple substitution model

    aa A R N D C Q E G H I L K M F P S T W Y V taa daa Vol G+C A+T

    A GCT 3 1 1 1 1 1 1 9 6 0.67 2 1

    A GCC 3 1 1 1 1 1 1 9 6 0.67 3 0

    A GCA 3 1 1 1 1 1 1 9 6 0.67 2 1

    A GCG 3 1 1 1 1 1 1 9 6 0.67 3 0

  • 7/24/2019 207945773 Molecular Evolution

    59/66

    A GCG 3 1 1 1 1 1 1 9 6 0.67 3 0

    R CGT 3 1 1 1 1 1 1 9 6 0.67 2 1

    R CGC 3 1 1 1 1 1 1 9 6 0.67 3 0

    R CGA 4 1 1 1 1 8 4 0.5 2 1

    R CGG 4 1 1 1 1 1 9 5 0.56 3 0

    R AGA 2 1 1 1 2 1 8 6 0.75 1 2

    R AGG 2 1 1 1 2 1 1 9 7 0.78 2 1

    N AAT 1 1 1 1 2 1 1 1 9 8 0.89 0 3

    N AAC 1 1 1 1 2 1 1 1 9 8 0.89 1 2

    D GAT 1 1 1 2 1 1 1 1 9 8 0.89 1 2

    D GAC 1 1 1 2 1 1 1 1 9 8 0.89 2 1

    C TGT 1 1 1 1 2 1 1 8 7 0.88 1 2

    C TGC 1 1 1 1 2 1 1 8 7 0.88 2 1Q CAA 1 1 1 2 1 1 1 8 7 0.88 1 2

    Q CAG 1 1 1 2 1 1 1 8 7 0.88 2 1

    E GAA 1 2 1 1 1 1 1 8 7 0.88 1 2

    E GAG 1 2 1 1 1 1 1 8 7 0.88 2 1

    G GGT 1 1 1 1 3 1 1 9 6 0.67 2 1

    G GGC 1 1 1 1 3 1 1 9 6 0.67 3 0

    G GGA 1 2 1 3 1 8 5 0.63 2 1

    G GGG 1 2 1 3 1 1 9 6 0.67 3 0

    H CAT 1 1 1 2 1 1 1 1 9 8 0.89 1 2

    H CAC 1 1 1 2 1 1 1 1 9 8 0.89 2 1

    I ATT 1 2 1 1 1 1 1 1 9 7 0.78 0 3

    I ATC 1 2 1 1 1 1 1 1 9 7 0.78 1 2

    I ATA 1 2 2 1 1 1 1 9 7 0.78 0 3

    L TTA 1 2 2 1 1 7 5 0.71 0 3

    L TTG 2 1 2 1 1 1 8 6 0.75 1 2

    L CTT 1 1 1 3 1 1 1 9 6 0.67 1 2

    L CTC 1 1 1 3 1 1 1 9 6 0.67 2 1

    L CTA 1 1 1 4 1 1 9 5 0.56 1 2

    L CTG 1 1 4 1 1 1 9 5 0.56 2 1

    K AAA 1 2 1 1 1 1 1 8 7 0.88 0 3

    K AAG 1 2 1 1 1 1 1 8 7 0.88 1 2

    M ATG 1 3 2 1 1 1 9 9 1. 1 2

    F TTT 1 1 3 1 1 1 1 9 8 0.89 0 3

    F TTC 1 1 3 1 1 1 1 9 8 0.89 1 2

    P CCT 1 1 1 1 3 1 1 9 6 0.67 2 1

    P CCC 1 1 1 1 3 1 1 9 6 0.67 3 0

    P CCA 1 1 1 1 3 1 1 9 6 0.67 2 1

    P CCG 1 1 1 1 3 1 1 9 6 0.67 3 0

    S TCT 1 1 1 1 3 1 1 9 6 0.67 1 2

    S TCC 1 1 1 1 3 1 1 9 6 0.67 2 1

    S TCA 1 1 1 3 1 7 4 0.57 1 2S TCG 1 1 1 3 1 1 8 5 0.63 2 1

    S AGT 3 1 1 1 1 1 1 9 8 0.89 1 2

    S AGC 3 1 1 1 1 1 1 9 8 0.89 2 1

    T ACT 1 1 1 1 2 3 9 6 0.67 1 2

    T ACC 1 1 1 1 2 3 9 6 0.67 2 1

    T ACA 1 1 1 1 1 1 3 9 6 0.67 1 2

    T ACG 1 1 1 1 1 1 3 9 6 0.67 2 1

    W TGG 2 2 1 1 1 7 7 1. 2 1

    Y TAT 1 1 1 1 1 1 1 7 6 0.86 0 3

    Y TAC 1 1 1 1 1 1 1 7 6 0.86 1 2

    V GTT 1 1 1 1 1 1 3 9 6 0.67 1 2

    V GTC 1 1 1 1 1 1 3 9 6 0.67 2 1

    V GTA 1 1 1 1 2 3 9 6 0.67 1 2

    V GTG 1 1 1 2 1 3 9 6 0.67 2 1

    Tot 36 54 18 18 18 18 18 36 18 27 54 18 9 18 36 54 36 9 18 36

    C d V l tilit St d d G ti C d

  • 7/24/2019 207945773 Molecular Evolution

    60/66

    Codons Volatility: Standard Genetic Code

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    RCGA

    RCGG

    LCTA

    LCTG

    STCA

    GGGA

    STCG

    AGCT

    AGCC

    AGCA

    AGCG

    RCGT

    RCGC

    GGGT

    GGGC

    GGGG

    LCTT

    LCTC

    PCCT

    PCCC

    PCCA

    PCCG

    STCT

    STCC

    TACT

    TACC

    TACA

    TACG

    VGTT

    VGTC

    VGTA

    VGTG

    LTTA

    LTTG

    RAGA

    RAGG

    IATT

    IATC

    IATA

    YTAT

    YTAC

    CTGT

    CTGC

    QCAA

    QCAG

    EGAA

    EGAG

    KAAA

    KAAG

    HCAT

    HCAC

    NAAT

    NAAC

    DGAT

    DGAC

    TTT

    TTC

    SAGT

    SAGC

    !ATG

    "TGG

    AA#Codons

    Arg Gly Leu Ser

    12 distinct volatility values

    only 4 aa contain synonymous codons (22) of different volatilities

    Vol 0 1 2 3

  • 7/24/2019 207945773 Molecular Evolution

    61/66

    Standard Genetic Code

    0 1 ( *

    0.)

    0.'

    0.,

    1.0

    G+CSpearman r = 0.4312

    p < 0.0005

    0.5 1

    0.56 1 1 1

    0.57 1

    0.63 2

    0.67 6 12 7

    0.71 1

    0.75 2

    0.78 2 1 1

    0.86 1 1

    0.88 1 4 3

    0.89 2 5 3

    1. 1 1

    Standard Genetic Code 0 1 2 3Vol

  • 7/24/2019 207945773 Molecular Evolution

    62/66

    Standard Genetic Code

    0 1 ( * )

    0.)

    0.$

    0.'

    0.+

    0.,

    0.-

    1.0

    A+T

    Spearman r = 0.4283

    p < 0.0006

    0 1 2 3

    1

    1 1 1

    1

    2

    7 12 6

    1

    2

    1 1 3

    1 1

    3 4 13 5 2

    1 1

    Vol

    0.5

    0.56

    0.57

    0.63

    0.67

    0.71

    0.75

    0.78

    0.86

    0.880.89

    1.

    http://../SELECTION_VOLATILEcodons/STATS/Vol_tab.dochttp://../SELECTION_VOLATILEcodons/STATS/Vol_tab.doc
  • 7/24/2019 207945773 Molecular Evolution

    63/66

    QuickTime et un dcompresseur TIFF (non compress) sont requis pour visionner cette image.

  • 7/24/2019 207945773 Molecular Evolution

    64/66

    QuickTime et un dcompresseur TIFF (non compress) sont requis pour visionner cette image.

    References:

  • 7/24/2019 207945773 Molecular Evolution

    65/66

    References:

    Ziheng Yang and Rasmus Nielsen (2000)

    Estimating synonymous and nonsynonymous substitution rates under realistic

    evolutionary models.Mol Biol Evol.17:32-43.

    Yang Z. and Bielawski J.P. (2000)

    Statistical methods for detecting molecular adaptation

    Trends Ecol Evol.15:496-503.Phylogenetic Analysis by Maximum Likelihood (PAML)

    http://abacus.gene.ucl.ac.uk/software/paml.html

    Plotkin JB, Dushoff J, Fraser HB (2004)

    Detecting selection using a single genome sequence of M. tuberculosis and P.falciparum.Nature 428:942-5.

    Molecular Evolution; A phylogenetic Approach

    Page, RDM and Holmes, EC (Blackwell Science, 2004)

    Sharp, PM & Li WH (1987). NAR 15:p.1281-1295.

  • 7/24/2019 207945773 Molecular Evolution

    66/66

    References

    MEGA: http://www.megasoftware.net/

    PAML: http://abacus.gene.ucl.ac.uk/software/paml.html

    Fundamental concepts of Bioinformatics.

    Dan E. Krane and Michael L. Raymer

    Genomes 2 edition. T.A. Brown

    Phylogeny programs :

    http://evolution.genetics.washington.edu/phylip/sftware.html

    Books:

    Molecular Evolution; A phylogenetic Approach

    Page, RDM and Holmes, EC

    S i