TREES
Dec 20, 2015
TREES
Trees
Human ChimpGorilla
=Chimp GorillaHumanChimp HumanGorilla
=Human Gorilla
=
Chimp
Human ChimpGorilla
≠
Gorilla ChimpHuman
≠
Human GorillaChimp
Same thing…
s4 s5s1 s3s2s4 s5s1 s3s2
=
The maximum parsimony principle
Evaluation of the tree topology
Genes: 0 = absent, 1 = present
speciesg1g2g3g4g5g6
s1100110
s2001000
s3110000
s4110111
s5001110
s1 s4 s3 s2 s5
Evaluate this tree…
s1 s4 s3 s2 s5
Gene number 1
1 1 1 0 0
10
1
s1 s4 s3 s2 s5
Gene number 1, Option number 1.
1 1 1 0 0
1
0
1
1
s1 s4 s3 s2 s5
Gene number 1, Option number 2.
Number of changes for g1 = 1
1 1 1 0 0
1
0
0
1
s1 s4 s3 s2 s5
Gene number 2, Option number 1.
0 1 1 0 0
1
0
0
1
s1 s4 s3 s2 s5
Gene number 2, Option number 2.
0 1 1 0 0
1
0
1
1
s1 s4 s3 s2 s5
Gene number 2, Option number 3.
0 1 1 0 0
0
0
0
0
Number of changes for g2 = 2
s1 s4 s3 s2 s5
Gene number 3, Option number 1.
0 0 0 1 1
0
1
0
0
s1 s4 s3 s2 s5
Gene number 3, Option number 2.
0 0 0 1 1
0
1
1
0
Number of changes for g3 = 1
s1 s4 s3 s2 s5
Gene number 4, Option number 1.
1 1 0 0 1
1
1
1
1
s1 s4 s3 s2 s5
Gene number 4, Option number 2.
1 1 0 0 1
0
0
0
1
Number of changes for g4 = 2
Gene number 5 is the same as Gene number 4
Number of changes for g5 = 2
s1 s4 s3 s2 s5
Gene number 6, 1option only:
0 1 0 0 0
0
0
0
0
Number of changes for g6 = 1
Sum of changes
Number of changes for g6 = 1
Number of changes for g5 = 2
Number of changes for g4 = 2
Number of changes for g3 = 1
Number of changes for g2 = 2
Sum of changes for this tree topology = 9
Can we do better ???
Number of changes for g1 = 1
s1 s4 s3 s2 s5
The MP (most parsimonious) tree:
Sum of changes for this tree topology = 8
How many rooted trees?
a ba b c b a c c a b
N=3, TR(3) = 3
b c da c b da d b ca a c db c a db
TR = “TREE ROOTED”
N=2, TR(2) = 1
d a cb a b dc b a dc d a bc a b cd
b a cd c a bd b c da c b da d b ca
N=4, TR(4) = 15
How many rooted trees
2 sequences: 1 tree3 sequences 3 trees4 sequences 3*5=15 trees5 sequences 3*5*7=105 trees.…TR(n) = 1*3*5*7*…..*(2n-3)
Rooting the tree
Rooted vs. unrooted trees
1
2
3
3 1
2
The position of the root does not affect the MP score.
Rooted vs. Unrooted:
s1 s4 s3 s2 s5
Gene number 1, Option number 1.
1 1 1 0 0
1
0
1
Intuition why rooting doesn’t change the score
The change will always be on the same branch, no matter where the root is positioned…
1
How can we root the tree?
we want rooted trees!
Gorilla gorilla
(Gorilla)
Homo sapiens (human)
Pan troglodytes (Chimpanzee)
Gallus gallus (chicken)
Evaluate all 3 possible UNROOTED trees:
Human
Chimp
Chicken
Gorilla
Human
Gorilla
Chimp
Chicken
Human
Chicken
Chimp
Gorilla
MP tree
Rooting based on a priori knowledge:
Human
Chimp
Chicken
Gorilla
Human ChimpChicken Gorilla
Ingroup / Outgroup:
Human ChimpChicken Gorilla
INGROUPOUTGROUP
Monophyletic groups
Human ChimpChicken Gorilla
The Gorilla+Human+Chimp are monophyletic
How to efficiently compute the MP score of a tree
The Fitch algorithm (1971):
A GC CA
Human ChimpChicken GorillaDuck
{A,G}
{A,C,G}
{A,C}
{A,C}
Post-order tree scan. In each node, if the intersection between the child-nodes is empty: we apply a union operator. Otherwise, an intersection.
Number of changes
A GC CA
Human ChimpChicken GorillaDuck
{A,G}
{A,C,G}
{A,C}
{A,C}
Total number of changes = number of union operators.
Parsimony has many shortcomings. To name a few:
(1) All changes are counted the same, which is not true for biological systems (Leu->Ile is much more likely than Leu->His).
(2) Cannot take biological context into account (secondary structures, dependencies among sites, evolutionary distances between the analyzed organisms, etc).
(3) Statistical basis questionable.
Alternative:
MAXIMUM-LIKELIHOOD METHOD
Maximum likelihood uses a probabilistic model of evolution
Each amino acid has a certain probability to change and this probability depends on the evolutionary distance.
Evolutionary distances are inferred from the entire set of sequences.
Evolutionary distancesPositions in an alignment can be conserved due to two reasons. Either because of functional constraints, or because a short evolutionary time elapsed since the divergence of the organisms.
5 replacements in 10 positions between 2 chimps, is considered very variable. 5 replacements between human and cucumber, is not considered too variable…
Maximum likelihood takes this information into account.
Maximum ParsimonyMaximum Likelihood
All changes are considered the same
Different probabilities to different types of
substitutions
Statistically questionable
Statistically robust
Ignores biological context
Accounts for biological context
)]()()()(
)()()([
)]()()()(
)()()([
)]()()()(
)()()([
6543
21
6543
21
6543
21
tPtPtPtP
tPtPXP
tPtPtPtP
tPtPXP
tPtPtPtP
tPtPXPDataP
FZEZZYCY
X Y ZYXGX
AZTZZYCY
X Y ZYXLX
AZMZZYCY
X Y ZYXKX
The likelihood computations
t1
t5
t3
X
CK
t2
ZY
M At6
t4
)()()()(
)()()(maxarg
6543
21
rtPrtPrtPrtP
rtPrtPXPRate
AZMZZYCY
X Y ZYXKXri
With likelihood models we can:
1. Infer the most likely phylogenetic tree2. Compute conservation for each site
Maximum likelihood tree reconstruction
This is incredibly difficult (and challenging) from the computational point of view, but efficient algorithms to find approximate solutions were developed.
Two steps:
1.Compute a distance D(i,j) between any two sequences i and j.
2.Find the tree that agrees most with the distance table.
Tree reconstruction using distance based methods
Neighbor-joining is based on Star decomposition
A
C
B
D
E
Red: best pair to group together
D
A
D
(C,B)A
E
((C,B),E)
In each step we cluster a pair so that the sum of branches is minimal
A few words on Human Immunodeficiency Virus (HIV)
The virus = HIVThe disease/syndrome = Aquired Immunodeficiency
First recognized clinically in 1981.
By 1992, it had become the major cause of death in individuals of 25-44 years of age in the U.S.
HIV
Till Dec 2002: 20 million people died of AIDS.
Infected in 2002: 5 millions.
Number of currently infected: ~42 millions
1 out of every 100 adults of age 15-49 in the world population.
HIV
HIV is the leading cause of death in sub-Sharan Africa. In some parts of this region 25-30% of the population is infected.
1 out of 3 children in these areas lost at least one of his parents.
Sub-Saharan Africa refers to the territories south to the Sahara. In the past the term ‘Black Africa’ has also been used to refer to the same region however today it is obsolete due to its ”politically incorrectness”
Tropical Africa might be taken as an alternative label of the same region however it excludes South Africa, which lies outside the tropics.
HIV is a lentivirus
Species = HIVGenus = LentivirusesFamily = Retroviridae
Lentiviruses have long incubation time, and are thus called “slow viruses”.
HIV-1 and HIV-2
In 1986, a distinct type of HIV prevalent in certain regions of West Africa was discovered and was termed HIV type 2.
Individuals infected with type 2 also had AIDS, but had longer incubation time and lower morbidity.
#cases in the population at time
population sizet
Morbidity vs. Mortality
•Morbidity: the prevalence of a disease:שיעור התחלואה
The probability that a randomly selected person out of the entire population is ill, at time t.
Morbidity vs. Mortality
Mortality: Deaths from a disease or at general
• Mortality rate = Death rate
שיעור התמותה
#deaths in the population at time interval
population sizet
Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes
Nature Vol. 397. Pages: 436-441.1999.
Five lines of evidence have been used to substantiate zoonotic transmission of primate lentivirus:
1. Similarities in viral genome organization;2. Phylogenetic relatedness;3. Geographic coincidence;4. Plausible routes of transmission;5. Prevalence in the natural host.
For HIV-2, a virus (SIVsm) that is genomically indistinguishable and phylogenetically closely related was found in substantial numbers of wild-living sooty mangabeys whose natural habitat coincides with the epicenter of the HIV-2 epidemic
מנגבי, קוף ארוך זנב מסוג סרקוסבוס מצוי באזורי היערות של אפריקה
Close contact between sooty mangabeys and humans is common because these monkey are hunted for food and kept as pets.
No fewer than six independent transmissions of SIVsm to humans have been proposed.
The origin of HIV-1 is much less certain.
HIV-1 is most similar in sequence and genomic organization to viruses found in chimpanzees (SIVcpz).
BUT, there are several doubts casting the theory that chimpanzees are the natural host and reservoir for HIV-1
1.There is a wide spectrum of diversity between HIV-1 and SIVcpz.
2. An apparent low prevalence of SIVcpz infection in wild-living animals.
3. The presence of chimpanzees in geographic regions of Africa where AIDS was not initially recognized.
Rather, it has been suggested that another, yet unidentified, primate species could be the natural host for SIVcpz and HIV-1.
“We recently identified a fourth chimpanzee with natural SIVcpz infection…”
This animal (Marilyn) was wild-caught in Africa (county of origin unknown), exported to the United States as an infant, and used as a breeding female in a primate facility until her death at age 26.
Marilyn
During a serosurvey in 1985, Marilyn was the only chimpanzee of 98 tested who had antibodies strongly reactive against HIV-1 by enzyme-linked immunosorbent assay (ELISA) and western immunoblot.
HOW was the SIV found
Maybe Marylin was infected with HIV during her stay in the U.S.?
“She has never been used in AIDS research and had not received human blood products after 1969. She died in 1985 after giving birth to still-born twins.”
Endometritis: דלקת רירית הרחםSepsis: אלח דם
“An autopsy revealed endometritis, retained placental elements and sepsis as the final cause of death. Depletion of lymhoid tissues was not noted.”
To convince that she did not have AIDS…
“PCR was used to amplify HIV- or SIV-related DNA sequences directly from uncultured (frozen) spleen and lymph-node tissue obtained at the autopsy in order to characterize the infection responsible for Marilyn’s HIV-1 seropositivity.”
Amplification and sequence analysis of subgenomic gag (508 base pairs (bp)) and pol (766 bp) fragments revealed the presence of a virus related to, but distinct from, known SIVcpz and HIV-1 strains.
PCR was used to amplify and sequence four overlapping subgenomic fragments that together comprised a complete proviral genome.
The genome was termed SIVcpzUS.
Provirus
The "provirus" is the form of the virus which is capable of being integrated into the host genome.
In the case of HIV it means the DNA "copy" of the HIV genome (HIV normally carries its genes around in RNA form).
Provirus
As far as the host cell's cellular machinery is concerned, this extra DNA is not different from the self DNA.
Only three other SIVcpz strains have been reported:
Two from animals wild-caught in Gabon (SIVcpzGAB1 and SIVcpzGAB2)
One from a chimpanzee exported to Belgium from Zaire (SIVcpzANT).
SIVcpzGAB1 and SIVcpzANT have been sequenced completely, but only 280bp of the pol sequence are available for SIVcpzGAB2.
To determine the evolutionary relationships of SIVcpzUS to these and other HIV and SIV sequences:
1.Sequences from the HIV sequence database (http://hiv-web.lanl.gov/HTML/compendium.html) were downloaded.
2.Neighbour-joining was used to construct the tree, based on the full-length Pol sequences.
3.Maximum likelihood was also used and “yielded very similar topologies”
The neighbour-joining method was applied to protein-sequence distances calculated by the method of Kimura.
Clade support values were computed with 1,000 bootstrap replicates.
NJ computations were computed using the CLUSTAL_X program.
These analyses identified SIVcpzUS unambiguously as a new member of the HIV-1/SIVcpz group of viruses.