Q u e e n s l a n d P a r a l l e l S u p e r c o m p u t i n g F o u n d a t i Q u e e n s l a n d P a r a l l e l S u p e r c o m p u t i n g F o u n d a t i 1. Professor Mark Ragan 1. Professor Mark Ragan (Institute for Molecular (Institute for Molecular Bioscience) Bioscience) 2. Dr Thomas Huber 2. Dr Thomas Huber (Department of Mathematics) (Department of Mathematics) Computational Biology and Computational Biology and Bioinformatics Environment Bioinformatics Environment ComBinE ComBinE National Facility Projects
13
Embed
Queensland Parallel Supercomputing Foundation 1. Professor Mark Ragan (Institute for Molecular Bioscience) 2. Dr Thomas Huber (Department of Mathematics)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
1. Professor Mark Ragan1. Professor Mark Ragan (Institute for Molecular (Institute for Molecular Bioscience)Bioscience)2. Dr Thomas Huber2. Dr Thomas Huber (Department of Mathematics) (Department of Mathematics)
Comparison of protein families among completely sequenced
microbial genomes
The scientific problem:
Handcrafted analyses suggest that gene transfer
in nature may be not only from parents to
offspring (“vertical”), but also from one lineage
to another (“lateral” or “horizontal”)
From microbial genomics we have complete
inventories of genes & proteins in ~ 80 genomes
Comparative analysis should identify all cases
of vertical and lateral gene transfer
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Computational requirement for 80 genomes:
1012 BLAST comparisons
5000 T-Coffee alignments
5000 Bayesian inference trees
107 topological comparisons
Find all interestingly large protein families in all microbial genomes
Generate structure-sensitive multiple alignments
Infer phylogenetic trees with appropriate statistics
Compare trees, look for topological incongruence
The approach
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Computations on APAC National Facility
Motif-based multiple alignment30-50 sequences = 2-5 hours per run
Will need ~5000 runs @ 4 - 60 seqs
Bayesian inferenceParameterisation of (MC)3 search
NF used for trials of up to 106 Markov
chain generations (~200 hours / run)
1.5-2.0 Gb RAM per run
Usage of NF:
Code not yet
parallelised
With each run
costing a few 10s of
hours and need for
1000s analyses, it’s
more efficient to use
many processors
simultaneously
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Parameterisation of Metropolis-coupled Markov chain Monte Carlo optimisation
through protein tree space
-13000
-12000
-11000
-10000
-9000
-8000
-7000
-6000
-5000
0 100000 200000 300000 400000 500000
Number of Markov chain generations
Ln
-lik
elih
oo
d
Ln-likelihood as function of number of generations
-14000
-12000
-10000
-8000
-6000
-4000
-2000
0
0 100000 200000 300000 400000 500000 600000
Number of generations
Ln
-lik
elih
oo
d
Log-likelihood as a function of number of Markov chain generations
Approach to stationarity under Jones et al. (1992) and General time-reversible models of protein sequence change
Bayesian inference (MrBayes 2.0) applied to 34-sequence Elongation Factor 1 dataset. Eight simultaneousMarkov chains, discrete approximation of gamma distribution ( = 0.29), chain temperature 0.1000
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
With thanks to collaborators
Mark Borodovsky, Georgia Tech
Robert Charlebois, NGI Inc. (Ottawa)
Tim Harlow, University of Queensland
Jeffrey Lawrence, University of Pittsburgh
Thomas Rand, St Mary’s University
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
Qu
een
slan
d Pa
ralle
l Su
percom
putin
g F
ou
nda
tion
1. Professor Mark Ragan1. Professor Mark Ragan (Institute for Molecular (Institute for Molecular Bioscience)Bioscience)2. Dr Thomas Huber2. Dr Thomas Huber (Department of Mathematics) (Department of Mathematics)