Analysis of gene expression data (Nominal explanatory variables)
Post on 13-Jan-2016
34 Views
Preview:
DESCRIPTION
Transcript
Analysis of gene expression dataAnalysis of gene expression data
(Nominal explanatory variables)(Nominal explanatory variables)
Shyamal D. PeddadaBiostatistics Branch
National Inst. Environmental Health Sciences (NIH)
Research Triangle Park, NC
Outline of the talkOutline of the talk
Two types of explanatory variables (“experimental conditions”)
Some scientific questions of interest
A brief discussion on false discovery rate (FDR) analysis
Some existing statistical methods for analyzing microarray data
Types of explanatory variables
Types of explanatory variables (“experimental conditions”)
Nominal variables:
– No intrinsic order among the levels of the explanatory variable(s).
– No loss of information if we permuted the labels of the conditions.
E.g. Comparison of gene expression of samples from “normal” tissue with those from “tumor” tissue.
Types of explanatory variables (“experimental conditions”)
Ordinal/interval variables:
– Levels of the explanatory variables are ordered.
– E.g.
Comparison of gene expression of samples from different stages of severity of lessions such as “normal”, “hyperplasia”, “adenoma” and “carcinoma”. (categorically ordered)
Time-course/dose-response experiments. (numerically ordered)
Focus of this talk: Nominal explanatory variables
Types of microarray dataTypes of microarray data
Independent samples
– E.g. comparison of gene expression of independent samples drawn from normal patients versus independent samples from tumor patients.
Dependent samples
– E.g. comparison of gene expression of samples drawn from normal tissues and tumor tissues from the same patient.
Possible questions of interestPossible questions of interest
Identify significant “up/down” regulated genes for a given “condition” relative to another “condition” (adjusted for other covariates).
Identify genes that discriminate between various “conditions” and predict the “class/condition” of a future observation.
Cluster genes according to patterns of expression over “conditions”.
Other questions?
ChallengesChallenges
Small sample size but a large number of genes.
Multiple testing – Since each microarray has thousands of genes/probes, several thousand hypotheses are being tested. This impacts the overall Type I error rates.
Complex dependence structure between genes and possibly among samples.
– Difficult to model and/or account for the underlying dependence structures among genes.
Multiple Testing:Type I Errors
- False Discovery Rates …
The Decision TableThe Decision Table
Number of Not
rejected
Number of
rejected
Total
Number of True
Number of True
0H
aH
0H 0H
0mU V
T S 1m
Total W R m
The only observable values
Strong and weak control of type I error rates
Strong control: control type I error rate under any combination of true
Weak control: control type I error rate only when all null hypotheses are true
Since we do not know a priori which hypotheses are true, we will focus on strong control of type I error rate.
H H a and0
Consequences of multiple testing
Suppose we test each hypothesis at 5% level of significance.
– Suppose n = 10 independent tests performed. Then the probability of declaring at least 1 of the 10 tests significant is 1 – 0.9510 = 0.401.
– If 50,000 independent tests are performed as in Affymetrix microarray data then you should expect 2500 false positives!
Types of errors in the context of multiple testing
Per-Family Error “Rate” (PFER): E(V )
– Expected number of false rejection of
Per-Comparison Error Rate (PCER): E(V )/m
– Expected proportion of false rejections of among all m hypotheses.
Family-Wise Error Rate (FWER): P( V > 0 )
– Probability of at least one false rejection of among all m hypotheses
0H
0H
0H
Types of errors in the context of multiple testing
False Discovery Rate (FDR):– Expected proportion of Type I errors among all rejected
hypotheses.
Benjamini-Hochberg (BH): Set V/R = 0 if R = 0.
Storey: Only interested in the case R > 0. (Positive FDR)
)0( )0|( )1( }0{ RPRR
VE
R
VE R
)0|( )1( }0{ RR
VE
R
VEpFDR R
Some useful inequalitiesSome useful inequalities
(1) 1
e therefor, Since
}0{
RR
V
m
V
mRV
(3) 1Also
(2) . 11 Thus
.1 1 Therefore
}0{
}0{}0{
}0{}0{
V R
V
RV
V
VR
VR
00 and since Again, VRRV
Some useful inequalities
(4) 11
:have we(3), and (2) (1), Combining
}0{}0{ VR
V
m
VVR
(5) }{ }1{1
:have we(4)in nsexpectatio Taking
}0{}0{ VEER
VE
m
VE VR
Some useful inequalities
(6)
:have weThus
PFERFWERFDRPCER
(7)
Trivially
pFDRFDR
Conclusion
It is conservative to control FWER rather than FDR!
It is conservative to control pFDR rather than FDR!
Some useful inequalities
FWER? Is pFDRQuestion:
Some useful inequalities
RVS
mmmNote
0
0 : 10
. Suppose :Example 0 mm
FWERVP
E
R
VEFDR
V
R
)0(
)1(
1
}0{
}0{
Some useful inequalities
.1)0|1(
0|But
RE
RR
VEpFDR
FWERFDRpFDR
mm
1
then if Hence 0
Some useful inequalities
However, in most applications such as microarrays, one expects
In general, there is no proof of the statement
01 m
FWERpFDR
Some popular Type I error Some popular Type I error controlling procedurescontrolling procedures
Let denote the ordered
p-values for the ‘m’ tests that are being performed.
Let denote the ordered
levels of significance used for testing the ‘m’ null hypotheses, respectively.
)()2()1( ... mPPP
)()2()1( ... m
)(0)2(0)1(0 ,...,, mHHH
Some popular controlling procedures
Step-down procedure:
)()2()1( ...,, rHHH
on. so and
Stop. Else
3 Step Goto - reject then If :3 Step
Stop. Else
3 Step Goto - reject then If :2 Step
Stop. Else
2 Step Goto - reject then If :1 Step
)3(0)3()3(
)2(0)2()2(
)1(0)1()1(
HP
HP
HP
Some popular controlling procedures
Step –up procedure:
on! so and
4. Step goto Else
stop. and 2,...2,1,reject then If :3 Step
3. Step goto Else
stop. and 1,...2,1,reject then If :2 Step
2. Step goto Else
stop. and ,...2,1,reject then If :1 Step
)(0)2()2(
)(0)1()1(
)(0)()(
miHP
miHP
miHP
imm
imm
imm
Some popular controlling procedures
Single-step procedure
A stepwise procedure with critical same critical constant for all ‘m’ hypotheses.
)()2()1( ... m
Some typical stepwise procedures: FWER controlling procedures
Bonferroni: A single-step procedure with
Sidak: A single-step procedure with
Holm: A step-down procedure with
Hochberg: A step-up procedure with
minP method: A resampling-based single-step procedure with
where be the α quantile of the distribution of
the minimum p-value.
mi /
)1/( imi
)1/( imi
ci
mi
/1)1(1
c
Comments on the methodsComments on the methods
Bonferroni: Very general but can be too conservative for large number of hypotheses.
Sidak: More powerful than Bonferroni, but applicable when the test statistics are independent or have certain types of positive dependence.
Comments on the methodsComments on the methods
Holm: More powerful than Bonferroni and is applicable for any type of dependence structure between test statistics.
Hochberg: More powerful than Holm’s procedure but the test statistics should be either independent or the test statistic have a MTP2 property.
Comments on the methods
Multivariate Total Positivity of Order 2 (MTP2)
f (x) is said to MTP2 if for all x,y R p ,
f (x y) f (x y) f (x) f (y)
Some typical stepwise procedures: FDR controlling procedure
Benjamini-Hochberg:
A step-up procedure with mii /
An IllustrationAn Illustration
Lobenhofer et al. (2002) data:
Expose breast cancer cells to estrodial for 1 hour or (12, 24 36 hours).
Number of genes on the cDNA 2 spot array - 1900.
Number of samples per time point 8.,
Compare 1 hour with (12, 24 and 36 hours) using a two-sided bootstrap t-test.
Some Popular Methods of Analysis
1. Fold-change
1. Fold-change in gene expression1. Fold-change in gene expression
For gene “g” compute the fold change between two conditions (e.g. treatment and control):
cont
trtg X
Xf
1. Fold-change in gene expression1. Fold-change in gene expression
: pre-defined constants.
: gene “g” is “up-regulated”.
: gene “g” is “down-regulated”.
fg R1
fg R2
21, RR
1. Fold-change in gene expression1. Fold-change in gene expression
Strengths:
– Simple to implement.– Biologists find it very easy to interpret.– It is widely used.
Drawbacks:
– Ignores variability in mean gene expression.– Genes with subtle gene expression values can be
overlooked. i.e. potentially high false negative rates
– Conversely, high false positive rates are also possible.
2. t-test type procedures2. t-test type procedures
2.1 Permutation t-test2.1 Permutation t-test
For each gene “g” compute the standard two-sample
t-statistic:
where are the sample means and is
the
pooled sample standard deviation.
conttrtg
contgtrtgg
nnS
XXt
11,,
contgtrtg XX ,, ,
Sg
2.1 Permutation t-test2.1 Permutation t-test
Statistical significance of a gene is determined by
computing the null distribution of using either
permutation or bootstrap procedure.
gt
2.1 Permutation t-test2.1 Permutation t-test
Strengths:
– Simple to implement.– Biologists find it very easy to interpret.– It is widely used.
Drawback:
– Potentially, for some genes the pooled sample standard deviation could be very small and hence it may result in inflated Type I errors and inflated false discovery rates.
2.2 SAM procedure2.2 SAM procedure(Significance Analysis of Microarrays) (Significance Analysis of Microarrays)
(Tusher et al., PNAS 2001)(Tusher et al., PNAS 2001)
For each gene “g” modify the standard two-sample t-statistic as:
The “fudge” factor is obtained such that the
coefficient of variation in the above test statistic is
minimized.
conttrtg
contgtrtgg
nnSs
XXd
110
,,
0s
3. F-test and its variations for 3. F-test and its variations for more than 2 nominal conditionsmore than 2 nominal conditions
Usual F-test and the P-values can be obtained by a suitable permutation procedure.
Regularized F-test: Generalization of Baldi and Long methodology for multiple groups.
– It better controls the false discovery rates and the powers comparable to the F-test.
Cui and Churchill (2003) is a good review paper.
4. Linear fixed effects models4. Linear fixed effects models
Effects:
– Array (A) - sample– Dye (D)– Variety (V) – test groups– Genes (G)– Expression (Y)
4. Linear fixed effects models4. Linear fixed effects models(Kerr, Martin, and Churchill, 2000)(Kerr, Martin, and Churchill, 2000)
Linear fixed effects model:
ijkgkgjgig
ijgjiijkg
VGDGAG
ADGDAY
)()()(
)()log(
vkVGH kg ,...,2,1 allfor 0)(:0
).,0(~ 2 Niid
ijkg
4. Linear fixed effects models4. Linear fixed effects models
All effects are assumed to be fixed effects.
Main drawback – all genes have same variance!
5. Linear mixed effects models5. Linear mixed effects models(Wolfinger et al. 2001)(Wolfinger et al. 2001)
Stage 1 (Global normalization model)
Stage 2 (Gene specific model)
gijijjigij TAATY )()log(
gijgjgiggij GAGTG )()(ˆ
5. Linear mixed effects models5. Linear mixed effects models
Assumptions:
),0(~
),0(~)( ),,0(~
),0(~)( ),,0(~
2
22
22
g
iid
gij
GAg
iid
gjijkg
TA
iid
ij
iid
i
N
NGAN
NTANA
5. Linear mixed effects models5. Linear mixed effects models(Wolfinger et al. 2001)(Wolfinger et al. 2001)
Perform inferences on the interaction term
giGT )(
A popular graphical representation:The Volcano Plots
A scatter plot of
vs
Genes with large fold change will lie outside a pair of vertical “threshold” lines. Further, genes which are highly significant with large fold change will lie either in the upper right hand or upper left hand corner.
)(log10 valuep ) (log2 changefold
A useful review articleA useful review article
Cui, X. and Churchill, G (2003), Genome Biology.
Software:
R package: statistics for microarray analysis.
http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html
SAM: Significance Analysis of Microarray. http://www-stat.stanford.edu/%7Etibs/SAM
Supervised classification algorithmsSupervised classification algorithms
Discriminant analysis based Discriminant analysis based methodsmethods
A. Linear and Quadratic Discriminant analysis based methods:
Strength:– Well studied in the classical statistics literature
Limitations:– Based on normality– Imposes constraints on the covariance matrices. Need
to be concerned about the singularity issue.
– No convenient strategy has been proposed in the literature to select “best” discrminating subset of genes.
Discriminant analysis based Discriminant analysis based methodsmethods
B. Nonparametric classification using Genetic Algorithm and K-nearest neighbors.– Li et al. (Bioinformatics, 2001)
Strengths:– Entirely nonparametric– Takes into account the underlying dependence structure
among genes– Does not require the estimation of a covariance matrix
Weakness:– Computationally very intensive
GA/KNN methodology – very brief GA/KNN methodology – very brief descriptiondescription
Computes the Euclidean distance between all pairs of samples based on a sub-vector on, say, 50 genes.
Clusters each sample into a treatment group (i.e. condition) based on the K-Nearest Neighbors.
Computes a fitness score for each subset of genes based on how many samples are correctly classified. This is the objective function.
The objective function is optimized using Genetic Algorithm
X
Expression levels of gene 1
Expre
ssio
n levels
of g
ene 2
K-nearest neighbors classification (k=3)
Expression levels of gene 1
Expre
ssio
n levels
of
gene 2
Subcategories within a class
Advantages of KNN approach
Simple, performs as well as or better than more complex methods
Free from assumptions such as normality of the distribution of expression levels
Multivariate: takes account of dependence in expression levels
Accommodates or even identifies distinct subtypes within a class
Expression data: many genes and few samples
There may be many subsets of genes that can statistically discriminate between the treated and untreated.
There are too many possible subsets to look at. With 3,000 genes, there are about 1072 ways to make subsets of size 30.
The genetic algorithm
Computer algorithm (John Holland) that works by mimicking Darwin's natural selection
Has been applied to many optimization problems ranging from engine design to protein folding and sequence alignment
Effective in searching high dimensional space
GA works by mimicking evolution
Randomly select sets (“chromosomes”) of 30 genes from all the genes on the chip
Evaluate the “fitness” of each “chromosome” – how well can it separate the treated from the untreated?
Pass “chromosomes” randomly to next generation, with preference for the fittest
Summary
Pay attention to multiple testing problem.
– Use FDR over FWER for large data sets such as gene expression microarrays
Linear mixed effects models may be used for comparing expression data between groups.
For classification problem, one may want to consider GA/KNN approach.
top related