Top Banner
I N D I A N A U N I V E R S I T Y 1 Evolutionary Biology and Computational Grids Craig Stewart [email protected] Director, Research and Academic Computing 10 November 1999 Please cite as: Stewart, C.A. 1999. Evolutionary Biology and Computational Grids. (Presentation) CASCON Workshop on Computational Biology (Mississauga, Ontario, Canada, 10 Nov 1999). Available from: http://hdl.handle.net/2022/14010 .
41

INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart [email protected] Director, Research and Academic Computing.

Dec 29, 2015

Download

Documents

Frank Harmon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 1

Evolutionary Biology and Computational Grids

Craig [email protected]

Director, Research and Academic Computing10 November 1999

Please cite as: Stewart, C.A. 1999. Evolutionary Biology and Computational Grids. (Presentation) CASCON Workshop on Computational Biology (Mississauga, Ontario, Canada, 10 Nov 1999). Available from: http://hdl.handle.net/2022/14010.

Page 2: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 2

Intellectual credits

• Collaborators– National University of Singapore

• Tan Tin Wee, Louxin Zhang (NUS), Meena Sakharkar

– ACSys (Advanced Computational SYStems, Australian National University)

• Markus Buckhorn

– Indiana University• David Hart, Donald K. Berry, Jeffery Palmer, Will Fischer, Chris Parkinson,

Sean Turner, Eric Wernert

• Code development– J. Felsenstein – DNAml (PHYLIP) [U. Washington]– G. Olsen – fastDNAml [UIUC]– H. Matsuda, R. Overbeek – initial P4 parallel code [ANL]– D.K. Berry – PVM and MPI ports [IU]

Page 3: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 3

Outline

• Phylogenies• Statistical methods for estimating phylogenies &

the fastDNAml program– Models of DNA replication and evolution– Algorithm– Parallelization

• Grid computing, HPCC, visualization• What we’ve learned so far• Future plans

Page 4: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 4

• This slide previously contained an image scanned from E. Colbert. 1965. The age of reptiles. W.W. Norton, NY, NY.

Page 5: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 5

Lots of DNA sequence data

• Automation of sequencing process• Many large-scale genomic projectsthermotoga   ATTTGCCCCA GAAATTAAAG CAAAAACCCC AGTAAGTTGG GGATGGCAAA AAAGGAAAAT

Tthermophi   ATTTGCCCCA GGGGTTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA

Taquaticus   ATTTGCCCCA GGGGTTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA

deinonema-   ATTTGCCCCA GGGATTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA

ChlamydiaB   ATTTTCCCCA GAAATTCCCG AAAAAACCCC AATAAATTGG GGATGGCAGG GGAGGAAGGA

flexistips   ATTTTCCCCA CAAAAAAAAG AAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA

borrelia-b   ATTTGCCCCA GAAGTTAAAG CAAAAACCCC AATAAGTTGG GGATGGCAGG GGAGGAAAAA

bacteroide   ATTTGCCCCA GAAATTCCCG CAAAAACCCC AGTAAATTGG GGATGGCAGG GGAGGAAAAA

pseudomona   ATTTGCCCCA GGGATTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA

ecoli-----   GTTTTCCCCA GAAATTCCCG CAAAAACCCC AGTAAGTTGG GGATGGCAGG GGAGGAAAAA

• 3B bases in human genome

Page 6: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 6

Statistical Methods in Phylogeny

• Availability of large amounts of genetic data makes possible application of statistical analysis to genetic data so as to create evolutionary phylogenies of organisms, organelles, or gene products.

Page 7: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 7

Confluence of events

• Development of computationally intensive methods for estimating phylogenies

• Abundance of DNA data– The limiting factor in scientists’ ability to analyze genetic data is

often the availability of computer time, not the availability of raw data

• Development of Grids as a high performance computing architecture– The concept of computational grids is dramatically changing the

way we think about HPC.

• IU’s biologists were eating our computers alive

Page 8: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 8

Maximum Likelihood

• Typical statistical inference: calculate probability of data given the hypothesis

• Phylogenetic tree building: tree, tree lengths, and associated likelihood values all calculated from the data.

• Likelihood values used only for comparisons• ML is most computationally intensive of the

mathematically-based phylogeny methodologies

Page 9: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 9

Tree estimation

Page 10: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 10

Page 11: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 11

DNA replication

Purines: Adenine & GuaninePyrimidines: Thymine & Cytosine

Page 12: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 12

Markov model of base substitution

• In a small interval of time t there is a probability u that a base at a site is replaced

• For any site: Pij(t) = e-utij + (1- e-utj

• Treat each site as independent (insertions and deletions outside capabilties of this program)

• Must correct for empirical base frequencies, unequal rates for transitions and transversions, and/or independent rates for specific changes

Page 13: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 13

fastDNAml’s phylogeny construction

• Objective: find the tree and branch lengths that have the greatest probability of giving rise to the present day sequences

• The number of bifurcating unrooted trees for n taxa is(2n-5)! ------------ (n-3)! 2n-3

for 50 taxa the number of possible trees is O(10 74)

• So, build trees incrementally, and search within the space of all possible trees looking for best tree

Page 14: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 14

fastDNAml algorithm

• Compute the optimal tree for three taxa (chosen randomly) - only one topology possible

• Randomly pick another taxon, and consider each of the 2i-5 trees possible by adding this taxon into the first, three-taxa tree.

• Keep the best (maximum likelihood tree)

Page 15: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 15

Initial steps in tree building

Page 16: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 16

Local branch rearrangement

• Move any subtree to a neighboring branch (2i-6 possibilities)

• Keep best resulting tree• Repeat this step until local swapping no longer

improves likelihood value

Page 17: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 17

Nearest neighbor interchange

Page 18: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 18

Iterate

• Get sequence data for next taxon• Add new taxa (2i-5)• Keep best• Local rearrangements (2i-6)• Keep best• Keep going….• When all taxa have been added, perform a full tree

check

Page 19: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 19

Because of local effects….

• Where you end up sometimes depends on where you start

• This process searches a huge space of possible trees, and is thus dependent upon the randomly selected initial taxa

• Can get stuck in local optimum, rather than global• Must do multiple runs with different randomizations of

taxa, and compare the results• Similar trees and likelihood values provide some

confidence

Page 20: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 20

How many calculations are there?

• For 50 taxa, there are i-5)+(2i-6)} = 4,559i=4,50

trees to evaluate presuming that no local rearrangements ever produces an improved tree. And each step is fairly computationally intensive.

• This algorithm is ideal for parallelization, because communications involve at most a tree and a probability value

Page 21: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 21

Overview of parallel program flow

Page 22: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 22

Geographically distributed computing

• The high computation/communication ratio makes this program a good candidate for geographic distribution

• Time to completion is a constant forever and ever• The key task is to combine geographically distributed

resources so that large jobs can be completed in tolerable (for the biologist) amounts of wall clock time

Page 23: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 23

Programming for geographically distributed computing

• Conversion of PVM version to grid-based computations

• Load balancing• Handles timeouts,

system crashes, etc.• Conversion to

MPI/Globus

Page 24: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 24

StarTAP

Page 25: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 25

Page 26: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 26

SC98 Demonstration

• Indiana University - SP nodes• NUS - SP nodes• ACSys – DEC Workstations• Immersadesk on the SC98 show floor as part of the

IU/EVL iGRID demonstration

Page 27: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 27

Page 28: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 28

Cytoplasmic Coat Proteins

Page 29: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 29

Performance of fastDNAml

0

500

1000

1500

2000

2500

3000

3500

0 2 4 6 8 10 12 14 16 18

# Processors

Wal

l clo

ck ti

me

(sec

onds

)

IU Only

IU&NUS

IU&ANU

Page 30: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 30

Applications

• Better understanding of evolution (Ceolocanths)

• Medicine – example: our cousins, the fungi– classification of genes & gene products

• Maintenance of biodiversity

Page 31: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 31

What we’ve learned so far

• We can run the program• We can do productive biology• Security is a headache, especially with PVM• Security is a headache, especially with Globus• The time difference causes some problems, but more

benefits in terms of the partnering opportunities

Page 32: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 32

Computing grids and Power Grids

• When you plug your hair dryer into an outlet, you don’t know how the power was generated or where it came from.

• Someday you’ll plug your laptop into a wall and cycles and storage will be available in a similarly magical fashion, but we’re a long way from that (plus it is probably an unrealistic goal for high-end computing).

• Before the current electrical power grid, there were regional electrical suppliers

• Before the regional electrical suppliers, there were battles over power standards, organizations of power companies, what type of generators were best, etc.

Page 33: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 33

Models for Computational Grids

• Geographically distributed organizations (NASA, ASCI)

• Alliances and consortia (NCSA, NPACI, CIC)

• A new approach: communities of interest

Page 34: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 34

Future Plans

• Make the ‘evolutionary biology grid’ a (periodically available) production service

• Enhance MPI/Globus version of code, make code publicly available

• Step up a level in parallelization• Key objective: create a geographically-distributed

version of fastDNAml that makes possible new advances in understanding of evolutionary biology.

Page 35: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 35

Particular benefits of IBM RS/6000 SPs

• Distributed memory ‘preadapts’ code for an individual SP to a geographically distributed scenario

• Excellent interface with storage systems

• Luck never hurts: many of our collaborators and potential collaborators have significant IBM installations

Page 36: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 36

Acknowledgements

• In addition to the intellectual debts noted at the beginning of this talk, our research has been greatly aided by Sponsored University Research grants from IBM

• This work would not have been possible without the cooperation and collaboration of Dr. Jeffrey Palmer and his research group.

Page 37: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 37

Acknowlegements, con’t

• The phylogeny depicted in slide 4 when this slide deck was presented was taken from E. Colbert. 1965. The age of reptiles. W.W. Norton, NY, NY. This diagram is not shown in this archived version of the slide show out of respect for copyright.

• The graphic of an unrooted tree in slide 9 is adapted from Olsen et al. 1994.

• Les Teach [IU] created all other graphics for this talk

Page 38: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 38

References

• Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368-376

• Olsen, Gary J., H. Matsuda, R. Hagstrom, R. Overbeek. 1994. fastDNAml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Computer Applications in Biosciences 10: 41-48

Page 39: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 39

References, con’t

• http://www.indiana.edu/uits/~rac• Foster, I., and C. Kesselman. 1998. The Grid:

blueprint for a new computing infrastructure. Morkan Kaufman Publishers, San Francisco

• Baxevanis, A.D., and B.F.F. Ouellette. 1998. Bioinformatics: a practical guide to the analysis of genes and proteins.Wiley-Interscience, NY.

Page 40: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 40

Thank you

• Any questions?

Page 41: INDIANAUNIVERSITYINDIANAUNIVERSITY 1 Evolutionary Biology and Computational Grids Craig Stewart stewart@indiana.edu Director, Research and Academic Computing.

I

N

D

I

A

N

A

U

N

I

V

E

R

S

I

T

Y 41

• Except where otherwise noted, the contents of this presentation are © the Trustees of Indiana University. This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.