1 Stuart M. Brown New York University School of Medicine With adaptations by H. Geller (GMU) presented by Molecular Phylogenetics Computing Evolution and Artificial Life
1
Stuart M. BrownNew York University School of MedicineWith adaptations by H. Geller (GMU)
presented by
Molecular Phylogenetics Computing Evolution and Artificial Life
2
Topics•
Life’s Levels of Organization
•
Emergent Properties•
Molecular Evolution
•
Calculating Distances •
Clustering Algorithms
•
Cladistic
Methods •
Computer Software
3
Recall Properties of LifeLiving organisms:– are composed of cells– are complex and ordered– respond to their environment– can grow and reproduce– obtain and use energy– maintain internal balance– allow for evolutionary adaptation
4
Levels of OrganizationCellular Organization
cellsorganelles
moleculesatoms
The cell
is the basic unit of life.
5
Levels of OrganizationOrganismal Level
organismorgan systems
organstissues
6
Levels of OrganizationPopulation Level
ecosystemcommunity
speciespopulation
7
Levels of OrganizationEach level of organization builds on the
level below it but often demonstrates new features.
Emergent properties: new properties present at one level that are not seen in the previous level
8
•
The theory of evolution is the foundation upon which all of modern biology is built.
Evolution
•
From anatomy to behavior to genomics, the scientific method requires an appreciation of changes in organisms over time.
•
It is impossible to evaluate relationships among gene sequences without taking into consideration the way these sequences have been modified over time
9
Nothing in biology makes sense except in the light of evolution.–
Theodosius Dobzhansky, 1973
10
Similarity searches and multiple alignments of sequences naturally lead to the question:
“How are these sequences related?”
and more generally:
“How are the organisms from which these sequences come related?”
Relationships
11
The purpose of a phylogenetic
tree is to illustrate how a group of objects (usually genes or organisms) are related to one another
12
Taxonomy•
The study of the relationships between groups of organisms is called taxonomy, an ancient and venerable branch of classical biology.
•
Taxonomy is the art of classifying things into groups —
a quintessential human
behavior —
established as a mainstream scientific field by Carolus
Linnaeus (1707-1778).
13
14
iClicker
Question•
When a plant or animal dies, the remains are usually lost.
– A True
– B False
15
iClicker
Question•
It is not possible to document the transition from one species to another with the fossil record.
– A True
– B False
16
iClicker
Question•
The fossil record is very complete.
– A True
– B False
17
iClicker
Question•
How many species of early life forms are estimated to be in the fossil record?
– A
1 out of every 10
– B
1 out of every 1000
– C
1 out of every 10,000
– D
One out of every 100,000
18
iClicker
Question•
Most species that have lived on Earth have died out and are now extinct.
– A True
– B False
19
iClicker
Question•
Vestigial organs are:
– A
Internal features that serve no useful function
– B
Organs attached to the vestigial bone
– C
Internal organs with an evolutionary link to the gills of fish
– D
A musical instrument produced in Vestig, Italy
20
Charles Darwin
Served as naturalist on mapping expedition around coastal South America.
Used many observations to develop his ideas
Proposed that evolution occurs by natural selection
21
Voyage of the Beagle
22
Charles DarwinEvolution:
modification of a species
over generations-“descent with modification”
Natural Selection: individuals with superior physical or behavioral characteristics are more likely to survive and reproduce than those without such characteristics
23
Darwin’s EvidenceSimilarity of related species
-
Darwin noticed variations in related species living in different locations
24
Darwin’s EvidencePopulation growth vs. availability of
resources
-population growthis geometric
-increase in foodsupply is arithmetic
25
Darwin’s EvidencePopulation growth vs. availability of
resources
-
Darwin realized that not all members of a population survive and reproduce.
-Darwin based these ideas on the writings of Thomas Malthus.
26
Post-Darwin Evolution EvidenceFossil record-
New fossils are found all the time
-
Earth is older than previously believed
Mechanisms of heredity-
Early criticism of Darwin’s ideas were resolved by Mendel’s theories for genetic inheritance.
27
Post-Darwin Evolution EvidenceComparative anatomy-
Homologous structures
have same
evolutionary origin, but different structure and function.
-
Analogous structures
have similar structure and function, but different evolutionary origin.
28
Homologous Structures
29
Post-Darwin Evolution EvidenceMolecular Evidence
- Our increased understanding of DNA and protein structures has led to the development of more accurate phylogenetic trees.
30
Time’s Story of Life•
First cell–
Natural selection
•
mutations•
Mutations–
Most not beneficial
•
Environment–
Impacts evolution
•
Eukaryotes•
Colonies
•
Hard Shell–
Cambrian explosion
31
Geological Time
32
Mass Extinctions and the Rate of Evolution
•
Rate of extinction–
10%-20% extinct in 5-6 million years
•
Mass extinctions–
30%-90% extinct
•
Mechanisms–
asteroid
•
Evolution –
Gradualism
–
Punctuated equilibrium
33
The Evolution of Human Beings
34
iClicker
Question•
Approximately how many “major”
mass extinctions do biogeologists recognize since Cambrian era?
– 5
– 50
– 5000
35
iClicker
Question•
A structure, process, or behavior that helps an organism survive and pass on its genes is called
– A
an adaptation
– B evolution
– C
survival of the fittest
36
iClicker
Question•
The concept of natural selection depends on which fact(s)?
– A
Life evolved from simple cells and the biggest ones were most likely to survive.
– B
Better camouflaged animals are less likely to be eaten and they are more likely to produce offspring.
– C
Every population contains some genetic diversity and many more individuals are born than can possibly survive.
– D A and B
– E B and C
37
iClicker
Question•
Human beings and the great apes had a common ancestor about:
– A
7 to 8 thousand years ago
– B
1 to 2 million years ago
– C
7 to 8 million years ago
– D
1 to 2 billion years ago
38
Phylogenetics•
Evolutionary theory states that groups of similar organisms are descended from a common ancestor.
•
Phylogenetic
systematics
(cladistics) is a method of taxonomic classification based on their evolutionary history.
•
It was developed by Willi Hennig, a German
entomologist, in 1950.
39
Cladistics
and Phenetics•
Cladistic
approach: Trees are drawn
based on the conserved characters•
Phenetic
approach: Trees are based
on some measure of distance between the leaves
•
Molecular phylogenies are inferred from molecular (usually sequence) data–
either cladistic
(e.g. gene order) or
phenetic
40
Cladistic
Methods•
Evolutionary relationships are documented by creating a branching structure, termed a phylogeny or tree, that illustrates the relationships between the sequences.
•
Cladistic
methods construct a tree (cladogram) by considering the various possible pathways of evolution and choose from among these the best possible tree.
•
A phylogram
is a tree with branches that are proportional to evolutionary distances.
41
42
Algorithm classes used to infer phylogeny from sequence
•
Distance methods•
Parsimony
•
Likelihood•
Probabilistic methods
43
Molecular Evolution•
Phylogenetics
often makes use of numerical data,
(numerical taxonomy) which can be scores for various “character states”
such as the size of a
visible structure or it can be DNA sequences.•
Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states.
•
In an alignment of DNA sequences, each position is a separate character, with four possible character states, the four nucleotides.
44
DNA is a good tool for taxonomy
DNA sequences have many advantages over classical types of taxonomic characters: –
Character states can be scored unambiguously
–
Large numbers of characters can be scored for each individual
–
Information on both the extent and the nature of divergence between sequences is available (nucleotide substitutions, insertion/deletions, or genome rearrangements)
45
A aat tcg ctt cta gga atc tgc cta atc ctgB ... ..a ..g ..a .t. ... ... t.. ... ..aC ... ..a ..c ..c ... ..t ... ... ... t.aD ... ..a ..a ..g ..g ..t ... t.t ..t t..
Each nucleotide difference is a character
46
•
After working with sequences for a while, one develops an intuitive understanding that for a given gene, closely related organisms have similar sequences and more distantly related organisms have more dissimilar sequences. These differences can be quantified.
•
Given a set of gene sequences, it should be possible to reconstruct the evolutionary relationships among genes and among organisms.
Sequences Reflect Relationships
47
48
What Sequences to Study?•
Different sequences accumulate changes at different rates -
chose level of variation that
is appropriate to the group of organisms being studied.–
Proteins (or protein coding DNAs) are constrained by natural selection -
better for very distant
relationships–
Some sequences are highly variable (rRNA
spacer
regions, immunoglobulin genes), while others are highly conserved (actin, rRNA
coding regions)
–
Different regions within a single gene can evolve at different rates (conserved vs. variable domains)
49
Orthologs
vs. Paralogs•
When comparing gene sequences, it is important to distinguish between identical vs. merely similar genes in different organisms.
•
Orthologs
are homologous genes in different species with analogous functions.
•
Paralogs
are similar genes that are the result of a gene duplication.–
A phylogeny that includes both orthologs
and paralogs
is likely to be incorrect.
–
Sometimes phylogenetic
analysis is the best way to determine if a new gene is an ortholog
or paralog
to other known genes.
50
A
A B
A2 B2A1 B1
Duplication
Speciation
(globin)
(hemoglobin) (myoglobin)
(mouse) (human)
Ancestral gene
51
Disclaimers
Before describing any theoretical or practical aspects of phylogenetics, it is necessary to give some disclaimers. This area of computational biology is an intellectual minefield!
Neither the theory nor the practical applications of any algorithms are universally accepted throughout the scientific community.
The application of different software packages to a data set is very likely to give different answers; minor changes to a data set are also likely to profoundly change the result.
52
53
A modern revision
of the seals and sea lions
54
Genes vs. Species•
Relationships calculated from sequence data represent the relationships between genes, this is not necessarily the same as relationships between species.
•
Your sequence data may not have the same phylogenetic
history as the species from which
they were isolated
•
Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA).
55
Cladistic
vs. PheneticWithin the field of taxonomy there are two different methods and philosophies of building phylogenetic
trees: cladistic
and phenetic
–
Phenetic
methods construct trees (phenograms) by considering the current states of characters without regard to the evolutionary history that brought the species to their current phenotypes.
•
Remember that phenotype is outward, physical manifestation of the organism, and genotype is the internally coded inheritable information.
–
Cladistic
methods rely on assumptions about ancestral relationships as well as on current data.
•
Clad or clade
is a branch of a phylogenetic
tree.
56
Darwin was a Cladist“The natural system based on descent
with modification …
the characters that naturalists consider as showing true affinity are those which have been inherited from a common parent, and in so far as all true classification is genealogical; that community of descent is the common bond that naturalists have been seeking.”
-
Charles Darwin, Origin of Species, 1859
57
Phenetic
Methods•
Computer algorithms based on the phenetic
model rely on
Distance Methods
to build of trees from sequence data.•
Phenetic
methods count each base of sequence
difference equally, so a single event that creates a large change in sequence (insertion/deletion or recombination) will move two sequences far apart on the final tree.
•
Phenetic
approaches generally lead to faster algorithms and they often have nicer statistical properties for molecular data.
•
The phenetic
approach is popular with molecular evolutionists because it relies heavily on objective character data (such as sequences) and it requires relatively few assumptions.
58
Distances Measurements•
It is often useful to measure the genetic distance between two species, between two populations, or even between two individuals.
•
The entire concept of numerical taxonomy is based on computing phylogenies
from a table of distances.
•
In the case of sequence data, pairwise
distances must be calculated between all sequences that will be used to build the tree -
thus creating a distance matrix.
•
Distance methods give a single measurement of the amount of evolutionary change between two sequences since divergence from a common ancestor.
59
Distance methodsCalculate the distance CORRECTING FOR MULTIPLE HITS
The Distance Matrix7 Rat Mouse Rabbit Human Opossum Chicken Frog
Rat 0.0000 0.0646 0.1434 0.1456 0.3213 0.3213 0.7018Mouse 0.0646 0.0000 0.1716 0.1743 0.3253 0.3743 0.7673Rabbit 0.1434 0.1716 0.0000 0.0649 0.3582 0.3385 0.7522Human 0.1456 0.1743 0.0649 0.0000 0.3299 0.2915 0.7116Oppossum 0.3213 0.3253 0.3582 0.3299 0.0000 0.3279 0.6653Chicken 0.3213 0.3743 0.3385 0.2915 0.3279 0.0000 0.5721Frog 0.7018 0.7673 0.7522 0.7116 0.6653 0.5721 0.0000
60
Computing a Distance MatrixReading sequences...
gtr1_human: 548 total, 548 readgtr2_human: 548 total, 548 readgtr3_human: 548 total, 548 readgtr4_human: 548 total, 548 readgtr5_human: 548 total, 548 read
Computing distances using Kimura method...1 x 2: 48.61 1 x 3: 45.501 x 4: 65.74 1 x 5: 107.702 x 3: 61.53 2 x 4: 74.572 x 5: 113.82 3 x 4: 68.933 x 5: 104.43 4 x 5: 110.86
Matrix 11 2 3 4 5
____________________________________________________________
..| 1 | 0.00 48.61 45.50 65.74 107.70| 2 | 0.00 61.53 74.57 113.82| 3 | 0.00 68.93 104.43| 4 | 0.00 110.86| 5 | 0.00
61
DNA Distances
•
Distances between pairs of DNA sequences are relatively simple to compute as the sum of all base pair differences between the two sequences. –
this type of algorithm can only work for pairs of sequences that are similar enough to be aligned
•
Generally all base changes are considered equal•
Insertion/deletions are generally given a larger weight than replacements (gap penalties).
•
It is also possible to correct for multiple substitutions at a single site, which is common in distant relationships and for rapidly evolving sites.
62
63
Correction for multiple hits•
Only differences can be observed directly –
not
distances•
All distance methods rely (crucially) on this
•
A great many models used for nucleotide sequences (e.g. JC, K2P, HKY, Rev, Maximum Likelihood)
•
aa sequences are infinitely more complicated!•
Can take account of different rates of evolution at sites (e.g. gamma distribution)
•
Accuracy falls off drastically for highly divergent sequences
64
Amino Acid Distances•
Distances between amino acid
sequences are a bit
more complicated to calculate. •
Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be functionally devastating.
•
From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence.
•
In practice, what has been done is to calculate tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks: i.e. PAM
and BLOSSUM
65
The PAM 250
scoring matrixA R N D C Q E G H I L K M F P S T W Y V
A 2R -2 6N 0 0 2 D 0 -1 2 4 C -2 -4 4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.
66
Clustering AlgorithmsClustering algorithms use distances to calculate phylogenetic
trees. These trees
are based solely on the relative numbers of similarities and differences between a set of sequences.
–
Start with a matrix of pairwise
distances
–
Cluster methods construct a tree by linking the least distant pairs of taxa, followed by successively more distant taxa.
67
Minimum Evolution•
The total length of all branches in the tree should be a minimum
•
It has been shown that the minimum evolution tree is expected to be the true tree provided branch lengths corrected for multiple hits
68
UPGMA•
The simplest of the distance methods is the UPGMA
(Unweighted
Pair Group Method using Arithmetic averages)
•
The PHYLIP
programs DNADIST
and PROTDIST calculate absolute pairwise
distances between a
group of sequences. Then the GCG
program GROWTREE
uses UPGMA
to build a tree.
•
Many multiple alignment programs such as PILEUP use a variant of UPGMA
to create a dendrogram
of
DNA sequences which is then used to guide the multiple alignment algorithm.
69
Neighbor Joining
•
The Neighbor Joining
method is the most popular way to build trees from distance measurements
(Saitou and Nei 1987, Mol. Biol. Evol. 4:406)
–
Neighbor Joining
corrects the UPGMA method for its (frequently invalid) assumption that the same rate of evolution applies to each branch of a tree.
–
The distance matrix is adjusted for differences in the rate of evolution of each taxon
(branch).
–
Neighbor Joining
has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)
70
Neighbour
Joining
87
6
54
1
2
3
8
7
6
5
23
4
1
71
Cladistic
Methods
•
For character data about the physical traits of organisms (such as morphology of organs etc.) and for deeper levels of taxonomy, the cladistic
approach is
almost certainly superior.
•
Cladistic
methods are often difficult to implement with molecular data because all of the assumptions are generally not satisfied.
72
Cladistic
methods
•
Cladistic
methods are based on the assumption that a set of sequences evolved from a common ancestor by a process of mutation and selection without mixing (hybridization or other horizontal gene transfers).
•
These methods work best if a specific tree, or at least an ancestral sequence, is already known so that comparisons can be made between a finite number of alternate trees rather than calculating all possible trees for a given set of sequences.
73
Parsimony•
Parsimony
is the most popular method for
reconstructing ancestral relationships.–
Derived from parsimonious used to mean least number (stingiest)
•
Parsimony
allows the use of all known evolutionary information in building a tree–
In contrast, distance methods compress all of the differences between pairs of sequences into a single number
74
Building Trees with Parsimony•
Parsimony
involves evaluating all possible
trees and giving each a score based on the number of evolutionary changes that are needed to explain the observed data.
•
The best tree is the one that requires the fewest base changes for all sequences to derive from a common ancestor.
75
•
Check each topology•
Count the minimum number of changes required to explain the data
•
Choose the tree with the smallest number of changes
•
Usually performs well with closely related sequences –
but often performs badly with
very distantly related sequences•
With distantly related sequences homoplasy
(similarity due to convergent
evolution, but independent origins)
becomes a major problem
Building Trees with Parsimony
76
Parsimony Example•
Consider four sequences: ATCG, TTCG, ATCC, and TCCG
•
Imagine a tree that branches at the first position, grouping ATCG and ATCC on one branch, TTCG and TCCG on the other branch.
•
Then each branch splits, for a total of 3 nodes
on the tree (Tree #1)
77Tree #1
Tree #2
Compare Tree #1 with one that first divides ATCC on its own branch, then splits off ATCG, and finally
divides TTCG from TCCG (Tree #2).
Trees #1 and #2 both have three nodes, but when all of the distances back to the root (# of nodes crossed) are summed, the total is equal to 8
for Tree
#1 and 9
for Tree #2.
78
Maximum Likelihood•
Require a model of evolution
•
Each substitution has an associated likelihood given a branch of a certain length
•
A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters
•
Function is minimized
79
Maximum Likelihood•
The method of Maximum Likelihood
attempts to reconstruct a phylogeny
using an explicit model of evolution.
•
This method works best when it is used to test (or improve) an existing tree.
•
Even with simple models of evolutionary change, the computational task is enormous, making this the slowest of all phylogenetic
methods.
80
Models can be made more parameter rich to increase their realism
•
The most common additional parameters are:–
A correction to allow different substitution rates for each type of nucleotide change
–
A correction for the proportion of sites which are unable to change
–
A correction for variable site rates at those sites which can change
•
The values of the additional parameters will be estimated in the process
81
Ancestral Sequences•
Maximum likelihood predicts ancestral sequences–
at branch points in the tree (nodes)
•
can provide information about the timing of the acquiring of a novel trait or mutation
•
PAML (Phylogenetic
Analysis using Maximum Likelihood)–
Confidence intervals provided
–
Selection can be inferred
82
Assumptions for Maximum Likelihood
•
The frequencies of DNA transitions (C<->T,A<->G) and transversions
(C or T<->A or G).
•
The assumptions for protein sequence changes are taken from the PAM matrix -
and are quite likely to
be violated in “real”
data.
•
Since each nucleotide site evolves independently, the tree is calculated separately for each site. The product of the likelihood's for each site provides the overall likelihood of the observed data.
83
The Molecular ClockFor a given protein the rate of sequence
evolution is approximately constant across lineages
Zuckerkandl and Pauling (1965)
This would allow speciation and duplication events to be dated accurately based on molecular data
Local and approximate molecular clocks more reasonable
84
Rooting the Tree•
In an unrooted
tree the direction of
evolution is unknown•
The root is the hypothesized ancestor of the sequences in the tree
•
The root can either be placed on a branch or at a node
•
You should start by viewing an unrooted
tree
85
86
87
Rooting Using an Outgroup•
The outgroup
should be a sequence (or set
of sequences) known to be less closely related to the rest of the sequences than they are to each other
•
It should ideally be as closely related as possible to the rest of the sequences while still satisfying condition 1
•
The root must be somewhere between the outgroup
and the rest (either on the node
or in a branch)
88
Are there Correct
trees??•
Despite all of these caveats, it is actually quite simple to use computer programs calculate phylogenetic
trees for data sets.
•
Provided the data are clean, outgroups
are correctly specified, appropriate algorithms are chosen, no assumptions are violated, etc., can the true, correct tree be found
and proven to be
scientifically valid?
•
Unfortunately, it is impossible to ever conclusively state what is the "true" tree for a group of sequences (or a group of organisms); taxonomy is constantly under revision as new data is gathered.
89
Is my tree correct?
Bootstrap valuesBootstrapping is a statistical technique that can use
random re-sampling of data to determine
sampling error for tree topologies•
Leave-one-out methods–
(leave out a row, not a species)
•
Agreement among the resulting trees is summarized with a majority-rule consensus tree
•
Each branch of the tree is labelled with the % of bootstrap trees where it occurred.
•
80% is good, less than 50% is bad
90
Non-Synonymous Substitutions•
There is MORE
information hidden in
alignments•
For each DNA substitution, we can observe if it changes the corresponding amino acid
•
due to the redundancy of the genetic code, a SYNONYMOUS (Ks)
substitution does not
change the AA•
a NON-SYNONYMOUS (Ka)
substitution
changes the AA at that codon•
[Need to correct the # of observed Ka and Ks for the possible number of each kind of changes that could occur in each codon]
91
Ka/Ks•
Neutral mutations will changes all bases at an equal rate, so Ka/Ks = 1
•
Conserved sequences will have Ka/Ks <1 [this is true for the vast majority of protien
coding seqences]
•
Ka/Ks >1 is a signature for selection (AA changes occur at a faster rate than expected by chance)–
discovery of a gene under positive selection by Ka/Ks>1 is a very big deal
[The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study.Nekrutenko
A, Makova
KD, Li WH. Genome Res. 2002 Jan;12(1):198-202.]
92
Ka/Ks varies within a gene
93
Computer Software for PhylogeneticsDue to the lack of consensus among evolutionary biologists about basic principles for phylogenetic
analysis, it is not surprising
that there is a wide array of computer software available for this purpose.–
PHYLIP
is a free package that includes 30
programs that compute various phylogenetic algorithms on different kinds of data. Command
line only -
hard to use.(Several free web servers provide a fuctional
user interface)–
CLUSTALX
is a multiple alignment program that
includes the ability to create tress based on Neighbor Joining.
Very easy to use, but NJ may not
always be the best method to handle your data.
94
Other useful software•
Mega
-
(free, Windows only) alignment, build trees,
estimate rates of evolution, •
Mesquite
-
(free Mac & Win)
advanced analysis of trees created by other programs
•
Phylowin
-
(free Mac & Win)
builds trees from a distance matrix (NJ, parsimony, max likelihood)
•
PAUP
-
(Commercial, Mac & Win)–
sophisticated, but fairly easy to use
–
Includes NJ, Parsimony, and Max. Likelihood–
Also does bootstrapping
•
Phylodendron
- (web) redraw trees
95
Other Web Resources•
Joseph Felsenstein
(author of PHYLIP) maintains a
comprehensive list of Phylogeny programs
at:http://evolution.genetics.washington.edu/phylip/software.html
•
Introduction to Phylogenetic
Systematics,Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Biologists
http://www.science.uts.edu.au/sasb/WestonCrisp.html
•
University of California, Berkeley Museum of Paleontology (UCMP)http://www.ucmp.berkeley.edu/clad/clad4.html
96
Software Hazards•
There are a variety of programs for Macs and PCs, but you can easily tie up your machine for many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)
•
Moving sequences into different programs can be a major hassle due to incompatible file formats.
•
Just because a program can perform a given computation on a set of data does not mean that that is the appropriate algorithm for that type of data.
97
Molecular Phylogeny ConclusionsGiven the huge variety of methods for computing phylogenies, how can the biologist determine what is the best method for analyzing a given data set?–
Published papers that address phylogenetic
issues
generally make use of several different algorithms and data sets in order to support their conclusions.
–
In some cases different methods of analysis can work synergistically
•
Neighbor Joining
methods generally produce just one tree, which can help to validate a tree built with the parsimony
or maximum likelihood
method–
Using several alternate
methods can give an indication of the robustness of a given conclusion.
98
Recall What is Life?•
State of a functional activity and continual change, before death (defined complimentarily as end-of-life).
•
Characterized by the capability to:•
Reproduce itself,
• Adapt to an environment in a quest for
survival, and•
Take Actions independent of exterior agents.
99
Nature as a special case of Life•
The Biology of Nature so far been the scientific study of life on Earth based on Carbon-chain chemistry.
•
However, nothing restricts the study of properties of life to carbon-chain chemistry; it is merely the only form of life so far available for study.
•
Further motivation to study life as a generic concept comes from the hypothesis that we are perhaps just one possible atom combination that makes this life
possible. We haven’t met other
examples (Aliens).
100
…which brings us to Artificial-Life•
Lack of any available non-carbon based life-
forms motivates us to create an artificial environment
and a set of rules
for life to
evolve.
•
Artificial Life, or ALife or
AL is the study of non-organic organisms, beyond the creations of nature, that possess the essential properties of life as we understand it, and whose environment is artificially created in an alternative media, which very often is a logical device like the computer.
101
ALife as a Synthesis approach•
Rather than being an analytical study of “natural”
life, A-Life is a Synthesis
approach to studying any form of Life.
•
We have :–
an artificially-created environment (usually) within computers,
–
A fairly universal set of rules and properties of life, derived from the one example we have of life -
Natural life.
102
So what is the motivation?•
A-Life could have been dubbed as yet-another-
approach to studying intelligent life, had it not been for the Emergent properties in life that motivates scientists to explore the possibility of artificially creating life and expecting the unexpected.
•
Recall that an emergent property is created when something becomes more than sum of its parts. For example, half a human is not capable of working without the other half, but together, capable of very complex behavior (not a representative example).
103
So where does A-Life fit in?•
The A-Life concept helps to:
•
Study existing natural life forms by trying to simulate the generic rules they follow, the environmental parameters like entropy/chaos , and the seed, i.e. the initial set of elements on which the rules of life apply under the given environmental condition, in order to understand evolution in nature.
•
Create new life within the digital world by creating new set of external parameters, seeds, and rules of evolution, and let life find a way.
104
So is A-Life = AI ??
Artificial Life Artificial Intelligence
Concept : Late 1980s Concept : 1960s
Grounded in Biology, Physics, Chemistry, Mathematics.
Pursued primarily in Comp. Sci, Engineering & Psychology.
Studies Intelligence as part of Life itself
Studies Intelligent behavior in isolation
Bottom-Up approach -
study synthesis
Top-Down approach -
focus is on results
Views life-as-it-could-be Views life-as-it-is
Both seem to approach similar problems, but…
105
A-Life : Emergence•
What you get when something is more than the sum of its parts.
•
Human thoughts rely on nearly all cells that make up the brain -
single cells are incapable of thought
-
thought is the emergence property of these cells coming together and interacting to give complex results -
motivation behind CA, NN.
•
Extreme example: Earth as a one living thing, consisting of whole of nature being in dynamic equilibrium, each part having baring on the other.
106
A-Life : Entropy•
Second Law of Thermodynamics : When two systems are joined together, the entropy (or chaos) in the combined system is greater than the sum of the individual systems.
•
This roughly applies to all systems, including those that exchange information.
•
Life is all about fighting against entropy : as other systems lose information to surroundings, life not only keeps hold of its information, but also increases its amount of information.
107
A-Life : Complexity•
Life is a complex system : It is a dynamic system that can keep on changing and evolving over a great period of time without dying.
•
If the amount of information exchange in a system is varied from low to high, it gives Fixed, Periodic, and Chaotic systems in that order. Somewhere in between, a system exhibits complex behavior.
•
Accordingly, each unit in a system either dies, freezes, pulsates, or behaves in a complex manner.
Fixed No Change, No Death
Periodic Change, No Evolution, No Death
Chaotic Change, Evolution,Death
Complex Change, Evolution, No Death
108
A-Life : Chaos Theory•
Chaos Theory
explains apparent randomness
-
many apparently random events are not truly random -
they are just iteration of simple rules on existing states (and possibly previous states) generating complex behavior -
they live on the edge of total chaos.
•
Most natural processes are chaotic -
sea, wind.•
Some man-made processes are chaotic -
Financial market.
•
Lack of knowledge of all rules,inputs and seed prevents us from determining the exact state of such a system at a point, but knowledge of some of those dominant rules/inputs lead to possible prediction of general behavior of the system.
•
This lack of knowledge of all parameters leads us to conclude it to be random behavior of the system.
109
A-Life : Current research areas•
Mathematical, Philosophical, Biological foundations, Social and Ethical implications of A-Life.
•
Cellular Automata•
Neural Networks
•
Genetic Algorithms•
Origin, Self-organization, Repair and Replication
•
Evolutionary / Adaptive Dynamics•
Autonomous,Adaptive and Evolving Robots
•
Software Agents (good/evil)•
Emergent Collective Behaviors, Swarms.
•
Synthetic/Artificial Chemistry/Biology/Materials•
Applications: Finance, Economics, Gaming, MEMS etc
110
ALife:Foundation/Implications•
Research on Foundation tries to answer questions about the motivation behind such a ground-breaking concept, using our existing knowledge base in Math, Chemistry, Biology, Philosophy of life etc. The Question is “How, why and where can the ALife approach succeed (or fail)?”
•
Research on Implications tries to understand and explain how the extension of life as a generic concept impacts our understanding of the very basics of natural life, shattering (or possibly not affecting) many-a-belief about God, creation and destruction. The Question here is “How does ALife fit in (if at all) to the present-day social setup of morals and ethics, often laid out by the various religious texts ?”
111
Alife : Cellular Automata•
Inspired by the way Natural biological cells behave and interact with their neighboring cells by following rules set out by the DNA code in them.
•
Cellular Automata (CA) is an array of N-
dimensional ‘cells’
that interact with their neighboring cells according to a pre-
determined set of rules, to generate actions, which in turn may trigger a new series of reactions on itself or its neighbors.
•
The best known example is Conway’s Life, which is a 2-state 2-D CA with simple rules (see on right) applied to all cells simultaneously to create generations of cells from an initial pattern.
•
Different initial patterns generate different behavorial
patterns, some die away
(unstable), some
blink (periodic), and the rest show complex behavior by continuing to live and evolve.
Conway’s Life: Rules
A living cell with 0-1neighbors dies of isolation
A living cell with 4+ neighbors dies from overcrowding
All other cells are unaffected