Stuart M. Brown

1

Stuart M. BrownNew York University School of MedicineWith adaptations by H. Geller (GMU)

presented by

Molecular Phylogenetics Computing Evolution and Artificial Life

2

Topics•

Life’s Levels of Organization

•

Emergent Properties•

Molecular Evolution

•

Calculating Distances •

Clustering Algorithms

•

Cladistic

Methods •

Computer Software

3

Recall Properties of LifeLiving organisms:– are composed of cells– are complex and ordered– respond to their environment– can grow and reproduce– obtain and use energy– maintain internal balance– allow for evolutionary adaptation

4

Levels of OrganizationCellular Organization

cellsorganelles

moleculesatoms

The cell

is the basic unit of life.

5

Levels of OrganizationOrganismal Level

organismorgan systems

organstissues

6

Levels of OrganizationPopulation Level

ecosystemcommunity

speciespopulation

7

Levels of OrganizationEach level of organization builds on the

level below it but often demonstrates new features.

Emergent properties: new properties present at one level that are not seen in the previous level

8

•

The theory of evolution is the foundation upon which all of modern biology is built.

Evolution

•

From anatomy to behavior to genomics, the scientific method requires an appreciation of changes in organisms over time.

•

It is impossible to evaluate relationships among gene sequences without taking into consideration the way these sequences have been modified over time

9

Nothing in biology makes sense except in the light of evolution.–

Theodosius Dobzhansky, 1973

10

Similarity searches and multiple alignments of sequences naturally lead to the question:

“How are these sequences related?”

and more generally:

“How are the organisms from which these sequences come related?”

Relationships

11

The purpose of a phylogenetic

tree is to illustrate how a group of objects (usually genes or organisms) are related to one another

12

Taxonomy•

The study of the relationships between groups of organisms is called taxonomy, an ancient and venerable branch of classical biology.

•

Taxonomy is the art of classifying things into groups —

a quintessential human

behavior —

established as a mainstream scientific field by Carolus

Linnaeus (1707-1778).

13

14

iClicker

Question•

When a plant or animal dies, the remains are usually lost.

– A True

– B False

15

iClicker

Question•

It is not possible to document the transition from one species to another with the fossil record.

– A True

– B False

16

iClicker

Question•

The fossil record is very complete.

– A True

– B False

17

iClicker

Question•

How many species of early life forms are estimated to be in the fossil record?

– A

1 out of every 10

– B

1 out of every 1000

– C

1 out of every 10,000

– D

One out of every 100,000

18

iClicker

Question•

Most species that have lived on Earth have died out and are now extinct.

– A True

– B False

19

iClicker

Question•

Vestigial organs are:

– A

Internal features that serve no useful function

– B

Organs attached to the vestigial bone

– C

Internal organs with an evolutionary link to the gills of fish

– D

A musical instrument produced in Vestig, Italy

20

Charles Darwin

Served as naturalist on mapping expedition around coastal South America.

Used many observations to develop his ideas

Proposed that evolution occurs by natural selection

21

Voyage of the Beagle

22

Charles DarwinEvolution:

modification of a species

over generations-“descent with modification”

Natural Selection: individuals with superior physical or behavioral characteristics are more likely to survive and reproduce than those without such characteristics

23

Darwin’s EvidenceSimilarity of related species

-

Darwin noticed variations in related species living in different locations

24

Darwin’s EvidencePopulation growth vs. availability of

resources

-population growthis geometric

-increase in foodsupply is arithmetic

25

Darwin’s EvidencePopulation growth vs. availability of

resources

-

Darwin realized that not all members of a population survive and reproduce.

-Darwin based these ideas on the writings of Thomas Malthus.

26

Post-Darwin Evolution EvidenceFossil record-

New fossils are found all the time

-

Earth is older than previously believed

Mechanisms of heredity-

Early criticism of Darwin’s ideas were resolved by Mendel’s theories for genetic inheritance.

27

Post-Darwin Evolution EvidenceComparative anatomy-

Homologous structures

have same

evolutionary origin, but different structure and function.

-

Analogous structures

have similar structure and function, but different evolutionary origin.

28

Homologous Structures

29

Post-Darwin Evolution EvidenceMolecular Evidence

- Our increased understanding of DNA and protein structures has led to the development of more accurate phylogenetic trees.

30

Time’s Story of Life•

First cell–

Natural selection

•

mutations•

Mutations–

Most not beneficial

•

Environment–

Impacts evolution

•

Eukaryotes•

Colonies

•

Hard Shell–

Cambrian explosion

31

Geological Time

32

Mass Extinctions and the Rate of Evolution

•

Rate of extinction–

10%-20% extinct in 5-6 million years

•

Mass extinctions–

30%-90% extinct

•

Mechanisms–

asteroid

•

Evolution –

Gradualism

–

Punctuated equilibrium

33

The Evolution of Human Beings

34

iClicker

Question•

Approximately how many “major”

mass extinctions do biogeologists recognize since Cambrian era?

– 5

– 50

– 5000

35

iClicker

Question•

A structure, process, or behavior that helps an organism survive and pass on its genes is called

– A

an adaptation

– B evolution

– C

survival of the fittest

36

iClicker

Question•

The concept of natural selection depends on which fact(s)?

– A

Life evolved from simple cells and the biggest ones were most likely to survive.

– B

Better camouflaged animals are less likely to be eaten and they are more likely to produce offspring.

– C

Every population contains some genetic diversity and many more individuals are born than can possibly survive.

– D A and B

– E B and C

37

iClicker

Question•

Human beings and the great apes had a common ancestor about:

– A

7 to 8 thousand years ago

– B

1 to 2 million years ago

– C

7 to 8 million years ago

– D

1 to 2 billion years ago

38

Phylogenetics•

Evolutionary theory states that groups of similar organisms are descended from a common ancestor.

•

Phylogenetic

systematics

(cladistics) is a method of taxonomic classification based on their evolutionary history.

•

It was developed by Willi Hennig, a German

entomologist, in 1950.

39

Cladistics

and Phenetics•

Cladistic

approach: Trees are drawn

based on the conserved characters•

Phenetic

approach: Trees are based

on some measure of distance between the leaves

•

Molecular phylogenies are inferred from molecular (usually sequence) data–

either cladistic

(e.g. gene order) or

phenetic

40

Cladistic

Methods•

Evolutionary relationships are documented by creating a branching structure, termed a phylogeny or tree, that illustrates the relationships between the sequences.

•

Cladistic

methods construct a tree (cladogram) by considering the various possible pathways of evolution and choose from among these the best possible tree.

•

A phylogram

is a tree with branches that are proportional to evolutionary distances.

41

42

Algorithm classes used to infer phylogeny from sequence

•

Distance methods•

Parsimony

•

Likelihood•

Probabilistic methods

43

Molecular Evolution•

Phylogenetics

often makes use of numerical data,

(numerical taxonomy) which can be scores for various “character states”

such as the size of a

visible structure or it can be DNA sequences.•

Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states.

•

In an alignment of DNA sequences, each position is a separate character, with four possible character states, the four nucleotides.

44

DNA is a good tool for taxonomy

DNA sequences have many advantages over classical types of taxonomic characters: –

Character states can be scored unambiguously

–

Large numbers of characters can be scored for each individual

–

Information on both the extent and the nature of divergence between sequences is available (nucleotide substitutions, insertion/deletions, or genome rearrangements)

45

A aat tcg ctt cta gga atc tgc cta atc ctgB ... ..a ..g ..a .t. ... ... t.. ... ..aC ... ..a ..c ..c ... ..t ... ... ... t.aD ... ..a ..a ..g ..g ..t ... t.t ..t t..

Each nucleotide difference is a character

46

•

After working with sequences for a while, one develops an intuitive understanding that for a given gene, closely related organisms have similar sequences and more distantly related organisms have more dissimilar sequences. These differences can be quantified.

•

Given a set of gene sequences, it should be possible to reconstruct the evolutionary relationships among genes and among organisms.

Sequences Reflect Relationships

47

48

What Sequences to Study?•

Different sequences accumulate changes at different rates -

chose level of variation that

is appropriate to the group of organisms being studied.–

Proteins (or protein coding DNAs) are constrained by natural selection -

better for very distant

relationships–

Some sequences are highly variable (rRNA

spacer

regions, immunoglobulin genes), while others are highly conserved (actin, rRNA

coding regions)

–

Different regions within a single gene can evolve at different rates (conserved vs. variable domains)

49

Orthologs

vs. Paralogs•

When comparing gene sequences, it is important to distinguish between identical vs. merely similar genes in different organisms.

•

Orthologs

are homologous genes in different species with analogous functions.

•

Paralogs

are similar genes that are the result of a gene duplication.–

A phylogeny that includes both orthologs

and paralogs

is likely to be incorrect.

–

Sometimes phylogenetic

analysis is the best way to determine if a new gene is an ortholog

or paralog

to other known genes.

50

A

A B

A2 B2A1 B1

Duplication

Speciation

(globin)

(hemoglobin) (myoglobin)

(mouse) (human)

Ancestral gene

51

Disclaimers

Before describing any theoretical or practical aspects of phylogenetics, it is necessary to give some disclaimers. This area of computational biology is an intellectual minefield!

Neither the theory nor the practical applications of any algorithms are universally accepted throughout the scientific community.

The application of different software packages to a data set is very likely to give different answers; minor changes to a data set are also likely to profoundly change the result.

52

53

A modern revision

of the seals and sea lions

54

Genes vs. Species•

Relationships calculated from sequence data represent the relationships between genes, this is not necessarily the same as relationships between species.

•

Your sequence data may not have the same phylogenetic

history as the species from which

they were isolated

•

Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA).

55

Cladistic

vs. PheneticWithin the field of taxonomy there are two different methods and philosophies of building phylogenetic

trees: cladistic

and phenetic

–

Phenetic

methods construct trees (phenograms) by considering the current states of characters without regard to the evolutionary history that brought the species to their current phenotypes.

•

Remember that phenotype is outward, physical manifestation of the organism, and genotype is the internally coded inheritable information.

–

Cladistic

methods rely on assumptions about ancestral relationships as well as on current data.

•

Clad or clade

is a branch of a phylogenetic

tree.

56

Darwin was a Cladist“The natural system based on descent

with modification …

the characters that naturalists consider as showing true affinity are those which have been inherited from a common parent, and in so far as all true classification is genealogical; that community of descent is the common bond that naturalists have been seeking.”

-

Charles Darwin, Origin of Species, 1859

57

Phenetic

Methods•

Computer algorithms based on the phenetic

model rely on

Distance Methods

to build of trees from sequence data.•

Phenetic

methods count each base of sequence

difference equally, so a single event that creates a large change in sequence (insertion/deletion or recombination) will move two sequences far apart on the final tree.

•

Phenetic

approaches generally lead to faster algorithms and they often have nicer statistical properties for molecular data.

•

The phenetic

approach is popular with molecular evolutionists because it relies heavily on objective character data (such as sequences) and it requires relatively few assumptions.

58

Distances Measurements•

It is often useful to measure the genetic distance between two species, between two populations, or even between two individuals.

•

The entire concept of numerical taxonomy is based on computing phylogenies

from a table of distances.

•

In the case of sequence data, pairwise

distances must be calculated between all sequences that will be used to build the tree -

thus creating a distance matrix.

•

Distance methods give a single measurement of the amount of evolutionary change between two sequences since divergence from a common ancestor.

59

Distance methodsCalculate the distance CORRECTING FOR MULTIPLE HITS

The Distance Matrix7 Rat Mouse Rabbit Human Opossum Chicken Frog

Rat 0.0000 0.0646 0.1434 0.1456 0.3213 0.3213 0.7018Mouse 0.0646 0.0000 0.1716 0.1743 0.3253 0.3743 0.7673Rabbit 0.1434 0.1716 0.0000 0.0649 0.3582 0.3385 0.7522Human 0.1456 0.1743 0.0649 0.0000 0.3299 0.2915 0.7116Oppossum 0.3213 0.3253 0.3582 0.3299 0.0000 0.3279 0.6653Chicken 0.3213 0.3743 0.3385 0.2915 0.3279 0.0000 0.5721Frog 0.7018 0.7673 0.7522 0.7116 0.6653 0.5721 0.0000

60

Computing a Distance MatrixReading sequences...

gtr1_human: 548 total, 548 readgtr2_human: 548 total, 548 readgtr3_human: 548 total, 548 readgtr4_human: 548 total, 548 readgtr5_human: 548 total, 548 read

Computing distances using Kimura method...1 x 2: 48.61 1 x 3: 45.501 x 4: 65.74 1 x 5: 107.702 x 3: 61.53 2 x 4: 74.572 x 5: 113.82 3 x 4: 68.933 x 5: 104.43 4 x 5: 110.86

Matrix 11 2 3 4 5

____________________________________________________________

..| 1 | 0.00 48.61 45.50 65.74 107.70| 2 | 0.00 61.53 74.57 113.82| 3 | 0.00 68.93 104.43| 4 | 0.00 110.86| 5 | 0.00

61

DNA Distances

•

Distances between pairs of DNA sequences are relatively simple to compute as the sum of all base pair differences between the two sequences. –

this type of algorithm can only work for pairs of sequences that are similar enough to be aligned

•

Generally all base changes are considered equal•

Insertion/deletions are generally given a larger weight than replacements (gap penalties).

•

It is also possible to correct for multiple substitutions at a single site, which is common in distant relationships and for rapidly evolving sites.

62

63

Correction for multiple hits•

Only differences can be observed directly –

not

distances•

All distance methods rely (crucially) on this

•

A great many models used for nucleotide sequences (e.g. JC, K2P, HKY, Rev, Maximum Likelihood)

•

aa sequences are infinitely more complicated!•

Can take account of different rates of evolution at sites (e.g. gamma distribution)

•

Accuracy falls off drastically for highly divergent sequences

64

Amino Acid Distances•

Distances between amino acid

sequences are a bit

more complicated to calculate. •

Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be functionally devastating.

•

From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence.

•

In practice, what has been done is to calculate tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks: i.e. PAM

and BLOSSUM

65

The PAM 250

scoring matrixA R N D C Q E G H I L K M F P S T W Y V

A 2R -2 6N 0 0 2 D 0 -1 2 4 C -2 -4 4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5

L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.

66

Clustering AlgorithmsClustering algorithms use distances to calculate phylogenetic

trees. These trees

are based solely on the relative numbers of similarities and differences between a set of sequences.

–

Start with a matrix of pairwise

distances

–

Cluster methods construct a tree by linking the least distant pairs of taxa, followed by successively more distant taxa.

67

Minimum Evolution•

The total length of all branches in the tree should be a minimum

•

It has been shown that the minimum evolution tree is expected to be the true tree provided branch lengths corrected for multiple hits

68

UPGMA•

The simplest of the distance methods is the UPGMA

(Unweighted

Pair Group Method using Arithmetic averages)

•

The PHYLIP

programs DNADIST

and PROTDIST calculate absolute pairwise

distances between a

group of sequences. Then the GCG

program GROWTREE

uses UPGMA

to build a tree.

•

Many multiple alignment programs such as PILEUP use a variant of UPGMA

to create a dendrogram

of

DNA sequences which is then used to guide the multiple alignment algorithm.

69

Neighbor Joining

•

The Neighbor Joining

method is the most popular way to build trees from distance measurements

(Saitou and Nei 1987, Mol. Biol. Evol. 4:406)

–

Neighbor Joining

corrects the UPGMA method for its (frequently invalid) assumption that the same rate of evolution applies to each branch of a tree.

–

The distance matrix is adjusted for differences in the rate of evolution of each taxon

(branch).

–

Neighbor Joining

has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)

70

Neighbour

Joining

87

6

54

1

2

3

8

7

6

5

23

4

1

71

Cladistic

Methods

•

For character data about the physical traits of organisms (such as morphology of organs etc.) and for deeper levels of taxonomy, the cladistic

approach is

almost certainly superior.

•

Cladistic

methods are often difficult to implement with molecular data because all of the assumptions are generally not satisfied.

72

Cladistic

methods

•

Cladistic

methods are based on the assumption that a set of sequences evolved from a common ancestor by a process of mutation and selection without mixing (hybridization or other horizontal gene transfers).

•

These methods work best if a specific tree, or at least an ancestral sequence, is already known so that comparisons can be made between a finite number of alternate trees rather than calculating all possible trees for a given set of sequences.

73

Parsimony•

Parsimony

is the most popular method for

reconstructing ancestral relationships.–

Derived from parsimonious used to mean least number (stingiest)

•

Parsimony

allows the use of all known evolutionary information in building a tree–

In contrast, distance methods compress all of the differences between pairs of sequences into a single number

74

Building Trees with Parsimony•

Parsimony

involves evaluating all possible

trees and giving each a score based on the number of evolutionary changes that are needed to explain the observed data.

•

The best tree is the one that requires the fewest base changes for all sequences to derive from a common ancestor.

75

•

Check each topology•

Count the minimum number of changes required to explain the data

•

Choose the tree with the smallest number of changes

•

Usually performs well with closely related sequences –

but often performs badly with

very distantly related sequences•

With distantly related sequences homoplasy

(similarity due to convergent

evolution, but independent origins)

becomes a major problem

Building Trees with Parsimony

76

Parsimony Example•

Consider four sequences: ATCG, TTCG, ATCC, and TCCG

•

Imagine a tree that branches at the first position, grouping ATCG and ATCC on one branch, TTCG and TCCG on the other branch.

•

Then each branch splits, for a total of 3 nodes

on the tree (Tree #1)

77Tree #1

Tree #2

Compare Tree #1 with one that first divides ATCC on its own branch, then splits off ATCG, and finally

divides TTCG from TCCG (Tree #2).

Trees #1 and #2 both have three nodes, but when all of the distances back to the root (# of nodes crossed) are summed, the total is equal to 8

for Tree

#1 and 9

for Tree #2.

78

Maximum Likelihood•

Require a model of evolution

•

Each substitution has an associated likelihood given a branch of a certain length

•

A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters

•

Function is minimized

79

Maximum Likelihood•

The method of Maximum Likelihood

attempts to reconstruct a phylogeny

using an explicit model of evolution.

•

This method works best when it is used to test (or improve) an existing tree.

•

Even with simple models of evolutionary change, the computational task is enormous, making this the slowest of all phylogenetic

methods.

80

Models can be made more parameter rich to increase their realism

•

The most common additional parameters are:–

A correction to allow different substitution rates for each type of nucleotide change

–

A correction for the proportion of sites which are unable to change

–

A correction for variable site rates at those sites which can change

•

The values of the additional parameters will be estimated in the process

81

Ancestral Sequences•

Maximum likelihood predicts ancestral sequences–

at branch points in the tree (nodes)

•

can provide information about the timing of the acquiring of a novel trait or mutation

•

PAML (Phylogenetic

Analysis using Maximum Likelihood)–

Confidence intervals provided

–

Selection can be inferred

82

Assumptions for Maximum Likelihood

•

The frequencies of DNA transitions (C<->T,A<->G) and transversions

(C or T<->A or G).

•

The assumptions for protein sequence changes are taken from the PAM matrix -

and are quite likely to

be violated in “real”

data.

•

Since each nucleotide site evolves independently, the tree is calculated separately for each site. The product of the likelihood's for each site provides the overall likelihood of the observed data.

83

The Molecular ClockFor a given protein the rate of sequence

evolution is approximately constant across lineages

Zuckerkandl and Pauling (1965)

This would allow speciation and duplication events to be dated accurately based on molecular data

Local and approximate molecular clocks more reasonable

84

Rooting the Tree•

In an unrooted

tree the direction of

evolution is unknown•

The root is the hypothesized ancestor of the sequences in the tree

•

The root can either be placed on a branch or at a node

•

You should start by viewing an unrooted

tree

85

86

87

Rooting Using an Outgroup•

The outgroup

should be a sequence (or set

of sequences) known to be less closely related to the rest of the sequences than they are to each other

•

It should ideally be as closely related as possible to the rest of the sequences while still satisfying condition 1

•

The root must be somewhere between the outgroup

and the rest (either on the node

or in a branch)

88

Are there Correct

trees??•

Despite all of these caveats, it is actually quite simple to use computer programs calculate phylogenetic

trees for data sets.

•

Provided the data are clean, outgroups

are correctly specified, appropriate algorithms are chosen, no assumptions are violated, etc., can the true, correct tree be found

and proven to be

scientifically valid?

•

Unfortunately, it is impossible to ever conclusively state what is the "true" tree for a group of sequences (or a group of organisms); taxonomy is constantly under revision as new data is gathered.

89

Is my tree correct?

Bootstrap valuesBootstrapping is a statistical technique that can use

random re-sampling of data to determine

sampling error for tree topologies•

Leave-one-out methods–

(leave out a row, not a species)

•

Agreement among the resulting trees is summarized with a majority-rule consensus tree

•

Each branch of the tree is labelled with the % of bootstrap trees where it occurred.

•

80% is good, less than 50% is bad

90

Non-Synonymous Substitutions•

There is MORE

information hidden in

alignments•

For each DNA substitution, we can observe if it changes the corresponding amino acid

•

due to the redundancy of the genetic code, a SYNONYMOUS (Ks)

substitution does not

change the AA•

a NON-SYNONYMOUS (Ka)

substitution

changes the AA at that codon•

[Need to correct the # of observed Ka and Ks for the possible number of each kind of changes that could occur in each codon]

91

Ka/Ks•

Neutral mutations will changes all bases at an equal rate, so Ka/Ks = 1

•

Conserved sequences will have Ka/Ks <1 [this is true for the vast majority of protien

coding seqences]

•

Ka/Ks >1 is a signature for selection (AA changes occur at a faster rate than expected by chance)–

discovery of a gene under positive selection by Ka/Ks>1 is a very big deal

[The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study.Nekrutenko

A, Makova

KD, Li WH. Genome Res. 2002 Jan;12(1):198-202.]

92

Ka/Ks varies within a gene

93

Computer Software for PhylogeneticsDue to the lack of consensus among evolutionary biologists about basic principles for phylogenetic

analysis, it is not surprising

that there is a wide array of computer software available for this purpose.–

PHYLIP

is a free package that includes 30

programs that compute various phylogenetic algorithms on different kinds of data. Command

line only -

hard to use.(Several free web servers provide a fuctional

user interface)–

CLUSTALX

is a multiple alignment program that

includes the ability to create tress based on Neighbor Joining.

Very easy to use, but NJ may not

always be the best method to handle your data.

94

Other useful software•

Mega

-

(free, Windows only) alignment, build trees,

estimate rates of evolution, •

Mesquite

-

(free Mac & Win)

advanced analysis of trees created by other programs

•

Phylowin

-

(free Mac & Win)

builds trees from a distance matrix (NJ, parsimony, max likelihood)

•

PAUP

-

(Commercial, Mac & Win)–

sophisticated, but fairly easy to use

–

Includes NJ, Parsimony, and Max. Likelihood–

Also does bootstrapping

•

Phylodendron

- (web) redraw trees

95

Other Web Resources•

Joseph Felsenstein

(author of PHYLIP) maintains a

comprehensive list of Phylogeny programs

at:http://evolution.genetics.washington.edu/phylip/software.html

•

Introduction to Phylogenetic

Systematics,Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Biologists

http://www.science.uts.edu.au/sasb/WestonCrisp.html

•

University of California, Berkeley Museum of Paleontology (UCMP)http://www.ucmp.berkeley.edu/clad/clad4.html

96

Software Hazards•

There are a variety of programs for Macs and PCs, but you can easily tie up your machine for many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)

•

Moving sequences into different programs can be a major hassle due to incompatible file formats.

•

Just because a program can perform a given computation on a set of data does not mean that that is the appropriate algorithm for that type of data.

97

Molecular Phylogeny ConclusionsGiven the huge variety of methods for computing phylogenies, how can the biologist determine what is the best method for analyzing a given data set?–

Published papers that address phylogenetic

issues

generally make use of several different algorithms and data sets in order to support their conclusions.

–

In some cases different methods of analysis can work synergistically

•

Neighbor Joining

methods generally produce just one tree, which can help to validate a tree built with the parsimony

or maximum likelihood

method–

Using several alternate

methods can give an indication of the robustness of a given conclusion.

98

Recall What is Life?•

State of a functional activity and continual change, before death (defined complimentarily as end-of-life).

•

Characterized by the capability to:•

Reproduce itself,

• Adapt to an environment in a quest for

survival, and•

Take Actions independent of exterior agents.

99

Nature as a special case of Life•

The Biology of Nature so far been the scientific study of life on Earth based on Carbon-chain chemistry.

•

However, nothing restricts the study of properties of life to carbon-chain chemistry; it is merely the only form of life so far available for study.

•

Further motivation to study life as a generic concept comes from the hypothesis that we are perhaps just one possible atom combination that makes this life

possible. We haven’t met other

examples (Aliens).

100

…which brings us to Artificial-Life•

Lack of any available non-carbon based life-

forms motivates us to create an artificial environment

and a set of rules

for life to

evolve.

•

Artificial Life, or ALife or

AL is the study of non-organic organisms, beyond the creations of nature, that possess the essential properties of life as we understand it, and whose environment is artificially created in an alternative media, which very often is a logical device like the computer.

101

ALife as a Synthesis approach•

Rather than being an analytical study of “natural”

life, A-Life is a Synthesis

approach to studying any form of Life.

•

We have :–

an artificially-created environment (usually) within computers,

–

A fairly universal set of rules and properties of life, derived from the one example we have of life -

Natural life.

102

So what is the motivation?•

A-Life could have been dubbed as yet-another-

approach to studying intelligent life, had it not been for the Emergent properties in life that motivates scientists to explore the possibility of artificially creating life and expecting the unexpected.

•

Recall that an emergent property is created when something becomes more than sum of its parts. For example, half a human is not capable of working without the other half, but together, capable of very complex behavior (not a representative example).

103

So where does A-Life fit in?•

The A-Life concept helps to:

•

Study existing natural life forms by trying to simulate the generic rules they follow, the environmental parameters like entropy/chaos , and the seed, i.e. the initial set of elements on which the rules of life apply under the given environmental condition, in order to understand evolution in nature.

•

Create new life within the digital world by creating new set of external parameters, seeds, and rules of evolution, and let life find a way.

104

So is A-Life = AI ??

Artificial Life Artificial Intelligence

Concept : Late 1980s Concept : 1960s

Grounded in Biology, Physics, Chemistry, Mathematics.

Pursued primarily in Comp. Sci, Engineering & Psychology.

Studies Intelligence as part of Life itself

Studies Intelligent behavior in isolation

Bottom-Up approach -

study synthesis

Top-Down approach -

focus is on results

Views life-as-it-could-be Views life-as-it-is

Both seem to approach similar problems, but…

105

A-Life : Emergence•

What you get when something is more than the sum of its parts.

•

Human thoughts rely on nearly all cells that make up the brain -

single cells are incapable of thought

-

thought is the emergence property of these cells coming together and interacting to give complex results -

motivation behind CA, NN.

•

Extreme example: Earth as a one living thing, consisting of whole of nature being in dynamic equilibrium, each part having baring on the other.

106

A-Life : Entropy•

Second Law of Thermodynamics : When two systems are joined together, the entropy (or chaos) in the combined system is greater than the sum of the individual systems.

•

This roughly applies to all systems, including those that exchange information.

•

Life is all about fighting against entropy : as other systems lose information to surroundings, life not only keeps hold of its information, but also increases its amount of information.

107

A-Life : Complexity•

Life is a complex system : It is a dynamic system that can keep on changing and evolving over a great period of time without dying.

•

If the amount of information exchange in a system is varied from low to high, it gives Fixed, Periodic, and Chaotic systems in that order. Somewhere in between, a system exhibits complex behavior.

•

Accordingly, each unit in a system either dies, freezes, pulsates, or behaves in a complex manner.

Fixed No Change, No Death

Periodic Change, No Evolution, No Death

Chaotic Change, Evolution,Death

Complex Change, Evolution, No Death

108

A-Life : Chaos Theory•

Chaos Theory

explains apparent randomness

-

many apparently random events are not truly random -

they are just iteration of simple rules on existing states (and possibly previous states) generating complex behavior -

they live on the edge of total chaos.

•

Most natural processes are chaotic -

sea, wind.•

Some man-made processes are chaotic -

Financial market.

•

Lack of knowledge of all rules,inputs and seed prevents us from determining the exact state of such a system at a point, but knowledge of some of those dominant rules/inputs lead to possible prediction of general behavior of the system.

•

This lack of knowledge of all parameters leads us to conclude it to be random behavior of the system.

109

A-Life : Current research areas•

Mathematical, Philosophical, Biological foundations, Social and Ethical implications of A-Life.

•

Cellular Automata•

Neural Networks

•

Genetic Algorithms•

Origin, Self-organization, Repair and Replication

•

Evolutionary / Adaptive Dynamics•

Autonomous,Adaptive and Evolving Robots

•

Software Agents (good/evil)•

Emergent Collective Behaviors, Swarms.

•

Synthetic/Artificial Chemistry/Biology/Materials•

Applications: Finance, Economics, Gaming, MEMS etc

110

ALife:Foundation/Implications•

Research on Foundation tries to answer questions about the motivation behind such a ground-breaking concept, using our existing knowledge base in Math, Chemistry, Biology, Philosophy of life etc. The Question is “How, why and where can the ALife approach succeed (or fail)?”

•

Research on Implications tries to understand and explain how the extension of life as a generic concept impacts our understanding of the very basics of natural life, shattering (or possibly not affecting) many-a-belief about God, creation and destruction. The Question here is “How does ALife fit in (if at all) to the present-day social setup of morals and ethics, often laid out by the various religious texts ?”

111

Alife : Cellular Automata•

Inspired by the way Natural biological cells behave and interact with their neighboring cells by following rules set out by the DNA code in them.

•

Cellular Automata (CA) is an array of N-

dimensional ‘cells’

that interact with their neighboring cells according to a pre-

determined set of rules, to generate actions, which in turn may trigger a new series of reactions on itself or its neighbors.

•

The best known example is Conway’s Life, which is a 2-state 2-D CA with simple rules (see on right) applied to all cells simultaneously to create generations of cells from an initial pattern.

•

Different initial patterns generate different behavorial

patterns, some die away

(unstable), some

blink (periodic), and the rest show complex behavior by continuing to live and evolve.

Conway’s Life: Rules

A living cell with 0-1neighbors dies of isolation

A living cell with 4+ neighbors dies from overcrowding

All other cells are unaffected

Stuart M. Brown

Documents