Page 1
Alex Lewin (Imperial College)
Sylvia Richardson (IC Epidemiology)Tim Aitman (IC Microarray Centre)
In collaboration with Anne-Mette Hein, Natalia Bochkina (IC Epidemiology)
Helen Causton (IC Microarray Centre)Peter Green (Bristol)
Bayesian Modelling for Differential Gene Expression
Page 2
Insulin-resistance gene Cd36
cDNA microarray: hybridisation signal for SHR much lower than for Brown Norway and SHR.4 control strains
Aitman et al 1999, Nature Genet 21:76-83
Page 3
Larger microarray experiment: look for other genes associated with Cd36
Microarray Data
3 SHR compared with 3 transgenic rats (with Cd36)
3 wildtype (normal) mice compared with 3 mice with Cd36 knocked out
12000 genes on each array
Biological Question
Find genes which are expressed differently between animals with and without Cd36.
Page 4
• Bayesian Hierarchical Model for Differential Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and differential expression
• Gene Ontology analysis for differentially expressed genes
Page 5
Low-level Model(how gene expression is estimated from signal)
Normalisation(to make arrays comparable)
Differential Expression
Clustering,Partition Model
We aim to integrate all the steps in a common statistical framework
Microarray analysis is amulti-step process
Page 6
Bayesian Modelling Framework
• Model different sources of variability simultaneously,within array, between array …
• Uncertainty propagated from data to parameter estimates (so not over-optimistic in conclusions).
• Share information in appropriate ways to get robust estimates.
Page 7
Gene Expression Data
3 wildtype mice, Fat tissue hybridised to Affymetrix chips
Newton et al. 2001Showed data fit well by
Gamma or Log Normal distributions
Kerr et al. 2000Linear model on log scale
sd
mean
Page 8
Data: ygsr = log expression for gene g, condition s, replicate rg = gene effectδg = differential effect for gene g between 2 conditionsr(g)s = array effect (expression-level dependent)gs
2 = gene variance
• 1st level yg1r | g, δg, g1 N(g – ½ δg + r(g)1 , g1
2), yg2r | g, δg, g2 N(g + ½ δg + r(g)2 , g2
2),
Σr r(g)s = 0 r(g)s = function of g , parameters {a} and {b}
Bayesian hierarchical model for differential expression
Page 9
Mean effect g
g ~ Unif (much wider than data range)
Differential effect δg
δg ~ N(0,104) – “fixed” effects (no structure in prior)
OR mixture:
δg ~ 0δ0 + 1G_ (1.5, 1) + 2G+ (1.5, 2)
Priors for gene effects
Explicit modellingof the alternativeH0
Page 10
Fixed Effects
Kerr et al. 2000
Mixture Models
Newton et al. 2004 (non-parametric mixture)
Löenstedt and Speed 2003, Smyth 2004
(conjugate mixture prior)
Broet et al. 2002 (several levels of DE)
References
Page 11
Two extreme cases:
(1) Constant variance gsr N(0, 2) Too stringent Poor fit
(2) Independent variances gsr N(0, g2)
! Variance estimates based on few replications are highly variable
Need to share information between genes to better estimate their variance, while allowing some variability Hierarchical model
Prior for gene variances
Page 12
• 2nd level gs
2 | μs, τs logNormal (μs, τs)
Hyper-parameters μs and τs can be influential.
Empirical BayesEg. Löenstedt and Speed 2003, Smyth 2004Fixes μs , τs
Fully Bayesian• 3rd level
μs N( c, d) τs Gamma (e, f)
Prior for gene variances
Page 13
Variances estimated using information from all G x R measurements (~12000 x 3) rather than just 3
Variances stabilised and shrunk towards average variance
Gene specific variances are stabilised
Page 14
Spline Curver(g)s = quadratic in g for ars(k-1) ≤ g ≤ ars(k)
with coeff (brsk(1), brsk
(2) ), k =1, … #breakpoints
Prior for array effects (Normalization)
Locations of break points not fixedMust do sensitivity checks on # break points
a1 a2 a3a0
Page 15
Array effect as a function of gene effect
loessBayesian posterior mean
Page 16
Before (ygsr)
After (ygsr- r(g)s )
Wildtype Knockout
Effect of normalisation on density
^
Page 17
• 1st level
– ygsr | g, δg, gs N(g – ½ δg + r(g)s , gs2),
• 2nd level
– Fixed effect priors for g, δg
– Array effect coefficients, Normal and Uniform
gs2 | μs, τs
logNormal (μs, τs)
• 3rd level
– μs N( c, d)
– τs Gamma (e, f)
Bayesian hierarchical model for differential expression
Page 18
Declare the model
WinBUGS software for fitting Bayesian models
for( i in 1 : ngenes ) { for( j in 1 : nreps) { y1[i, j] ~ dnorm(x1[i, j], tau1[i]) x1[i, j] <- alpha[i] - 0.5*delta[i] + beta1[i, j] }} for( i in 1 : ngenes ) { tau1[i] <- 1.0/sig21[i] sig21[i] <- exp(lsig21[i]) lsig21[i] ~ dnorm(mm1,tt1) }mm1 ~ dnorm( 0.0,1.0E-3)tt1 ~ dgamma(0.01,0.01)
WinBUGS does the calculations
Page 19
Whole posterior distribution
Posterior means, medians, quantiles
WinBUGS software for fitting Bayesian models
Page 20
• Bayesian Hierarchical Model for Differential Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and differential expression
• Gene Ontology analysis for differentially expressed genes
Page 21
So far, discussed fitting the model.
How do we decide which genes are differentially
expressed?
Parameters of interest: g , δg , g
– What quantity do we consider, δg , (δg /g) , … ?
– How do we summarize the posterior distribution?
Decision Rules for Inference
Page 22
Inference on δ
(1) dg = E(δg | data) posterior mean
Like point estimate of log fold change.
Decision Rule: gene g is DE if |dg| > δcut
(2) pg = P( |δg| > δcut | data)
posterior probability (incorporates uncertainty)
Decision Rule: gene g is DE if pg > pcut
This allows biologist to specify what size of
effect is interesting (not just statistical significance)
Fixed Effects Model
biologicalinterest
biologicalinterest
statisticalconfidence
Page 23
Inference on δ,
(1) tg = E(δg | data) / E(g | data)
Like t-statistic.
Decision Rule: gene g is DE if |tg| > tcut
(2) pg = P( |δg /g| > tcut | data)
Decision Rule: gene g is DE if pg > pcut
Bochkina and Richardson (in preparation)
Fixed Effects Model
statisticalconfidence
statisticalconfidence
Page 24
δg ~ 0δ0 + 1G_ (1.5, 1) + 2G+ (1.5, 2)
Mixture Model
(1) dg = E(δg | data) posterior mean
Shrunk estimate of log fold change.
Decision Rule: gene g is DE if |dg| > δcut
(2) Classify genes into the mixture components.
pg = P(gene g not in H0 | data)
Decision Rule: gene g is DE if pg > pcut
H0
Explicit modellingof the alternative
Page 25
Illustration of decision rule
pg = P( |δg| > log(2)
and g > 4 | data)
x pg > 0.8
Δ t-statistic > 2.78 (95% CI)
Page 26
• Bayesian Hierarchical Model for Differential Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and differential expression
• Gene Ontology analysis for differentially expressed genes
Page 27
Bayesian P-values
• Compare observed data to a “null” distribution
• P-value: probability of an observation from the null distribution being more extreme than the actual observation
• If all observations come from the null distribution, the distribution of p-values is Uniform
Page 28
Cross-validation p-values
Distribution of p-values {pi, i=1,…,n} is approximately Uniform if model adequately describes the data.
Idea of cross validation is to split the data: one part for fitting the model, the rest for validation
n units of observation
For each observation yi, run model on rest of data y-i, predict new data yi
new from posterior distribution.
Bayesian p-value pi = Prob(yinew > yi | data y-i)
Page 29
Posterior Predictive p-values
“all data” includes yi p-values are less extreme than they should be
p-values are conservative (not quite Uniform).
Bayesian p-value pi = Prob(yinew > yi | all data)
For large n, not possible to run model n times.
Run model on all data. For each observation yi, predict new data yi
new from posterior distribution.
Page 30
Bayesian p-value Prob( Sg2 new > Sg
2 obs | data)
Example: Check priors on gene variances
1) Compare equal and exchangeable variance models2) Compare different exchangeable priors
Want to compare data for each gene, not gene and replicate, so use sample variance Sg
2 (suppress index s here)
Page 31
WinBUGS code for posterior predictive checks
for( i in 1 : ngenes ) { for( j in 1 : nreps) { y1[i, j] ~ dnorm(x1[i, j], tau1[i]) ynew1[i, j] ~ dnorm(x1[i, j], tau1[i]) x1[i, j] <- alpha[i] - 0.5*delta[i] + beta1[i, j] } s21[i] <- pow(sd(y1[i, ]), 2) s2new1[i] <- pow(sd(ynew1[i, ]), 2) pval1[i] <- step(s2new1[i] - s21[i])}
replicate relevant sampling distribution
calculate sample variances
count no. times predicted sample variance is bigger than observed sample variance
Page 32
Posterior predictivePrior
parameters
ygr
Mean parameters
r = 1:R
g = 1:G
g2
Sg2
newSg
2
ygr
new
Graph shows structure of model
Page 33
Mixed predictivePrior
parameters
ygr
Mean parameters
r = 1:R
g = 1:G
g2
Sg2
newSg
2
ygr
new
g2
new
Less conservative than posterior predictive(Marshall and Spiegelhalter, 2003)
Page 34
Equal variance model:Model 1: 2 log Normal (0, 10000)
Exchangeable variance models:
Model 2: g-2 Gamma (2, β)
Model 3: g-2 Gamma (α, β)
Model 4: g2 log Normal (μ, τ)
(α, β, μ, τ all parameters)
Four models for gene variances
Page 35
Bayesian predictive p-values
Page 36
• Bayesian Hierarchical Model for Differential Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and differential expression
• Gene Ontology analysis for differentially expressed genes
Page 37
Expression level dependent normalization
Many gene expression data sets need normalization which depends on expression level.
Usually normalization is performed in a pre-processing step before the model for differential expression is used.
These analyses ignore the fact that the expression level is measured with variability.
Ignoring this variability leads to bias in the function used for normalization.
Page 38
Simulated Data
Gene variances similar range and distribution to mouse data
Array effects cubic functions of expression level
Differential effects900 genes: δg = 0
50 genes: δg N( log(3), 0.12)
50 genes: δg N( -log(3), 0.12)
Page 39
Array Effects and Variability for Simulated Data
Data points: ygsr – yg (r = 1…3)
Curves: r(g)s (r = 1…3)
_
Page 40
Two-step method (using loess)
1) Use loess smoothing to obtain array effects loessr(g)s
2) Subtract loess array effects from data: yloessgsr
= ygsr - loessr(g)s
3) Run our model on yloessgsr with no array effects
Page 41
Decision rules for selecting differentially expressed genes
If P( |δg| > δcut | data) > pcut then gene g is called differentially expressed.
δcut chosen according to biological hypothesis of interest (here we use log(3) ).
pcut corresponds to the error rate (e.g. False Discovery Rate or Mis-classification Penalty) considered acceptable.
Page 42
Full model v. two-step method
Plot observed False Discovery Rate against pcut (averaged over 5 simulations)
Solid line for full model
Dashed line for pre-normalized method
Page 43
1) yloessgsr = ygsr - loess
r(g)s
2) ymodelgsr = ygsr - E(r(g)s | data)
Results from 2 different two-step methods are much closer to each other than to full model results.
Different two-step methods
Page 44
• Bayesian Hierarchical Model for Differential Expression
• Decision Rules
• Predictive Model Checks
• Simultaneous estimation of normalization and differential expression
• Gene Ontology analysis for differentially expressed genes
Page 45
Gene Ontology (GO)
Database of biological terms
Arranged in graph connecting related terms
Directed Acyclic Graph: links indicate more specific terms
~16,000 terms
from QuickGO website (EBI)
Page 46
Gene Ontology (GO)
from QuickGO website (EBI)
Page 47
Gene Annotations
• Genes/proteins annotated to relevant GO terms
• Gene may be annotated to several GO terms
• GO term may have 1000s of genes annotated to it (or none)
• Gene annotated to term A annotated to all ancestors of A (terms that are related and more general)
Page 48
GO annotations of genes associated with the insulin-resistance gene Cd36
Compare GO annotations of genes most and least differentially expressed
Most differentially expressed ↔ pg > 0.5 (280 genes)
Least differentially expressed ↔ pg < 0.2 (11171 genes)
Page 49
GO annotations of genes associated with the insulin-resistance gene Cd36
For each GO term, Fisher’s exact test on
proportion of differentially expressed genes with annotations
v.
proportion of non-differentially expressed genes with annotations
observed O = A
expected E = C*(A+B)/(C+D)
if no association of GO
annotation with DE
FatiGO website
http://fatigo.bioinfo.cnio.es/
genes annot. to GO term
genes not annot. to GO term
genes mostdiff. exp.
genes leastdiff. exp.
A B
C D
Page 50
GO annotations of genes associated with the insulin-resistance gene Cd36
O = observed no. differentially expressed genesE = expected no. differentially expressed genes
Page 51
Response to external stimulus(O=12, E=4.7)
Response to biotic stimulus(O=14, E=6.9)
Response to stimulus
Physiological process
Organismal movement
Biological process
Response to external biotic stimulus *
Inflammatory response(O=4, E=1.2)
Immune response(O=9, E=4.5)
Response to wounding(O=6, E=1.8)
Response to stress(O=12, E=5.9)
Defense response(O=11, E=5.8)
Response to pest, pathogen or parasite(O=8, E=2.6)
All GO ancestors of Inflammatory response
* This term was not accessed by FatiGO
Relations between GO terms were found using QuickGO:http://www.ebi.ac.uk/ego/
Page 52
Further Work to do on GO
• Account for dependencies between GO terms
• Multiple testing corrections
• Uncertainty in annotation
( work in preparation )
Page 53
Summary
• Bayesian hierarchical model flexible, estimates variances robustly
• Predictive model checks show exchangeable prior good for gene variances
• Useful to find GO terms over-represented in the most differentially-expressed genes
Paper available (Lewin et al. 2005, Biometrics, in press)
http ://www.bgx.org.uk/
Page 55
• In full Bayesian framework, introduce latent allocation variable zg = 0,1 for gene g in null, alternative
• For each gene, calculate posterior probability of belonging to unmodified component: pg = Pr( zg = 0 | data )
• Classify using cut-off on pg (Bayes rule corresponds to 0.5)
• For any given pg , can estimate FDR, FNR.
Decision Rules
For gene-list S, est. (FDR | data) = Σg S pg / |S|
Page 56
The Null Hypothesis
Composite Null
Point Null, alternative not modelled
Point Null, alternative modelled