Modeling Bayesian Phylogenetic Inference in Protein Data Analysis by Using Mr. Bayes, Proml, Consensus Applications
Jul 06, 2015
Modeling Bayesian Phylogenetic Inference in Protein
Data Analysis by Using Mr. Bayes, Proml, Consensus
Applications
Mr. Bayes vs. Proml (maximum likelihood)
1 3 5 7 9
11
13
15
17
19
21
S1
0
2000
4000
6000
8000
10000
12000
Series1Series2
CPU time/Mr. Bayes/Proml
1
9
17
S1
S2
S3
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Series1
Series2
Series3
Diff. of Maximum Likelihood(Mr. Bayes – Proml)
vs. CPU (sec)
maximum likelihood
0500
10001500200025003000350040004500
0 200 400 600 800
diff (postml - proml)
cp
u t
ime (
sec)
Series1
Diff. of Maximum Likelihood(Mr. Bayes – Proml)
vs. CPU (sec)maximum likelihood
0500
10001500200025003000350040004500
1 4 7 10 13 16 19
diff (postml - preml)
cp
u t
ime (
sec)
Series1
Series2
Linear Regression in Testing Datasets
linear regression
0
2000
4000
6000
8000
10000
12000
0 5000 10000 15000
Series1
Testing Datasets Plus One/Two Long Branch’s Datasets
147101316192225283134
S10
2000
4000
6000
8000
10000
12000
14000
16000
mrbayes vs proml (plus AB,CD data)
Series1
Series2
Linear Regression After Bayesian Correction for Testing Datasets & One/Two Long Branch’s Datasets
0
2000
4000
6000
8000
10000
12000
14000
16000
0 5000 10000 15000 20000
Series1
Phylogeny for All Testing Datasets
phy all
-0.5
0
0.5
1
1.5
2
2.5
3
0 50 100 150 200 250 300
no.
leng
th Series1
Phylogeny for All Datasets
phylogeny for all datasets
-0.5
0
0.5
1
1.5
2
2.5
3
0 50 100 150 200 250 300 350 400
no.
leng
th Series1
One Long Branch Datasets
one long branch
-0.5
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60
no.
leng
th Series1
Two Long Branches Datasets
two long branches
00.20.40.60.8
11.21.41.6
0 10 20 30 40 50 60
no.
leng
th Series1
Phylogeny (sequence length from Proml)
phy07
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0 5 10 15
species
len
gth
Series1
AB50J
-0.5
0
0.5
1
1.5
2
0 2 4 6 8
no.
len
gth
Series1
CD20J
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8
no.
len
gth
Series1
phy06
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0 2 4 6 8 10
species
len
gth
Series1
One/Two Long Branch’s Datasets(Maximum Likelihood)
CD
10J
1200
0.74
179
1132
5.10
468
CD
50J
1264
8.05
314
1193
6.72
682
S111000115001200012500
1300013500
14000
Series1
Data Analysis• Testing datasets: phy01 ~ phy21, nexus01
~ nexus21)
• Experimental datasets: one long branch (AB10J ~ AB70aj), two long branches (CD10J ~ CD70aj)
• Operation systems: Mac OS X ver. 10.3.9
• Dual 800 MHZ PowerPC G4
• 256 MB SDRAM• Mr. Bayes – 3.1.1
• Phylip 3.67 (Proml, Consensus)
continue• Testing sample size: 21x2• Experimental samples: 7x2• Degree of freedom: 20• Chi square: 283.1561 > 31.41(alpha=0.05)• Proml and Mr. Bayes are two dep val.• ANOVA Ssw=2669051, Ssb=24253093• Sstotal=50943143.71• Eta square= 0.476081596• Type I error=0.05• Type II error=1.83%• Power= 98.17%• Instrument threshold=1xE-8
Testing Datasets
y(Mr.Bayes)= 1.058351726x(Proml)+14.79771
0.999724correl
14.79771intercept
1.058352slope
Testing datasets in linear regression between Mr. Bayes and Proml)
104.6243131.7778878.5355sd
226.959296.33331576.467mean
diff(Mr.Bayes-Proml)characterCPU
Testing samples:
0.996717correl
0.109857f-test5343.856intercept
3.47E-05t-test0.492193slope
Linear regression between experimental samples:
364.0589179.5717sd
13122.8611802.88mean
CD(two long branches)AB(one long branch)
Experimental samples:
Linear Regression for All
y(Mr. Bayes)= 1.058352x(Proml)+14.79764
0.999959correl
14.79764intercept
1.058352slope
Linear regression for all datasets(including experimental and testing)
385.302190.05sd
13903.512506.4mean
CD(two long branches)AB(one long branch)
After Bayesian modeling
Tree Hierarchical Structure: AB10J• AB10J.JTT• +----------seq.7 • | • +-----5 +---------seq.4 • | | | • | +---2 +-------seq.6 • | | +----4 • | +----3 +----------seq.5 • | | • | +-----------seq.2 • | • 1------------------------------------------seq.3 • | • +--------seq.1 • AB10J.consensus• +--------------------seq.4• |• +--1.0-| +------seq.6• | | +--1.0-|• | +--1.0-| +------seq.5• +------| |• | | +-------------seq.2• | |• | | +------seq.1• | +----------------1.0-|• | +------seq.3• |• +----------------------------------seq.7
Histogram AB10J
AB10J
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 2 4 6 8
no.
len
gth
Series1
Tree Hierarchical Structure:CD10J• CD10J.JTT• +----seq.7 • | • +--5 +----seq.4 • | | | • | +-2 +-------------------seq.6 • | | +-4 • | +--3 +-----seq.5 • | | • | +-----seq.2 • | • 1---seq.3 • | • +---------------------seq.1 • CD10J.consensus• +--------------------seq.4• |• +--1.0-| +------seq.6• | | +--1.0-|• | +--1.0-| +------seq.5• +------| |• | | +-------------seq.2• | |• | | +------seq.1• | +----------------1.0-|• | +------seq.3• |• +----------------------------------seq.7
Histogram CD10J
CD10J
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 2 4 6 8
no.
len
gth
Series1
Discussion
• Bayesian modeling can be used to evaluate type I,II errors, eta square, power, Chi square X2, Anova, correlated coefficient, linear regression etc ..
• It is possible to design a 2x2 table in order to evaluate risk such as RD, RR, RO
• Proml and consensus features bring out a histogram’s profile including hierarchical tree structure and it is possible for peak area integration
Questions
• CPU time can be used to count all activities in hydrogen bonds through kinesthetic module in computer, and hydrogen bond’s configurations of DNA match from pairs of A-T, A-U. C-G, and/or DNA alignment from separate genetic codes of A, T, U, C, G.
• CPU time is possible to count all triggering by stem cell activity through functional proteins.
• CPU time has been already used in Forensic science to count pattern differentiation from suspect sample in judiciary investigations.