Summer Inst. Of Epidemiology and Summer Inst. Of Epidemiology and Biostatistics, 2009: Biostatistics, 2009: Gene Expression Data Analysis Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 8:30am-12:00pm in Room W2017 Carlo Colantuoni – [email protected]http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/ GEA2009.htm
103
Embed
Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni – [email protected].
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Summer Inst. Of Epidemiology and Summer Inst. Of Epidemiology and Biostatistics, 2009:Biostatistics, 2009:
Gene Expression Data AnalysisGene Expression Data Analysis
8:30am-12:00pm in Room W20178:30am-12:00pm in Room W2017
Some genes are more variable than othersSome genes are more variable than others
Slides from Rob Scharpf
Slides from Rob Scharpf
Slides from Rob Scharpf
Slides from Rob Scharpf
Slides from Rob Scharpf
distribution of
distribution of
Slides from Rob Scharpf
Slides from Rob Scharpf
X1-X2 is normally distributed if X1 and X2 are normally distributed – is this the case in microarray data?
Problem 1Problem 1: T-statistic not t-distributed. : T-statistic not t-distributed. ImplicationImplication: p-values/inference incorrect: p-values/inference incorrect
P-values by permutationP-values by permutation
• It is common that the assumptions used to derive the statistics are not approximate enough to yield useful p-values (e.g. when T-statistics are not T distributed.)
• An alternative is to use permutations.
pp-values by permutations-values by permutations
We focus on one gene only. For the bth iteration, b = 1, , B;
• Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”.
• For each gene, calculate the corresponding two sample t-statistic, tb.
After all the B permutations are done:
• p = # { b: |tb| ≥ |tobserved| } / B
• This does not yet address the issue of multiple tests!
The volcano plot shows, for a particular test, negative The volcano plot shows, for a particular test, negative log p-value against the effect size (M).log p-value against the effect size (M).
Another problem with t-testsAnother problem with t-tests
Remember this?Remember this?
Problem 2Problem 2: t-statistic bigger for genes: t-statistic bigger for genes with smaller standard with smaller standard
error estimates.error estimates.ImplicationImplication: Ranking might not be optimal: Ranking might not be optimal
Problem 2Problem 2
• With low N’s SD estimates are unstable
• Solutions:
– Significance Analysis in Microarrays (SAM)
– Empirical Bayes methods and Stein estimators
Significance analysis in Significance analysis in microarrays (SAM)microarrays (SAM)
• A clever adaptation of the t-ratio to borrow information across genes
• Implemented in Bioconductor in the siggenes package
Significance analysis of microarrays applied to the ionizing radiation response, Tusher et al., PNAS 2002
SAM d-statisticSAM d-statistic
• For gene i :
di y i x isi s0
y i
x i
is
0s
mean of sample 1
mean of sample 2
Standard deviation of repeated measurements for gene i
Exchangeability factor estimated using all genes
Minimize the average CV across all genes.
Scatter plots of relative difference (d) vs standard Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurementsdeviation (s) of repeated expression measurements
Random fluctuationsin the data, measured by balanced permutations(for cell line 1 and 2)
Relative difference fora permutation of the datathat was balanced between cell lines 1 and 2.
A fix for this problem:
SAM produces a modified T-statistic (d), and has an approach to the multiple
comparison problem.
Selected genes:Selected genes:Beyond expected distributionBeyond expected distribution
• An advantage of having tens of thousands of genes is that we can try to learn about typical standard deviations by looking at all genes
• Empirical Bayes gives us a formal way of doing this
• “Shrinkage” of variance estimates toward a “prior”: moderated t-statistics – eliminates extreme stats due to small variances.
• Implemented in the limma package in R. In addition, limma provides methods for more complex experimental designs beyond simple, two-sample designs.
• False positive rate is the rate at which truly null genes are called significant
• False discovery rate is the rate at which significant genes are truly null
tsignifican called#
positives false#FDR
nulltruly #
positives false#FPR
False Positive Rate False Positive Rate and and P-valuesP-values
• The p-value is a measure of significance in terms of the false positive rate (aka Type I error rate)
• P-value is defined to be the minimum false positive rate at which the statistic can be called significant
• Can be described as the probability a truly null statistic is “as or more extreme” than the observed one
False Discovery Rate False Discovery Rate and and Q-valuesQ-values
• The q-value is a measure of significance in terms of the false discovery rate
• Q-value is defined to be the minimum false discovery rate at which the statistic can be called significant
• Can be described as the probability a statistic “as or more extreme” is truly null
Power and Sample Size Power and Sample Size Calculations are HardCalculations are Hard
• Need to specify:– (Type I error rate, false positives) or FDR– (stdev: will be sample- and gene-specific)– Effect size (how do we estimate?)– Power (1-, =Type II error rate)– Sample Size
• Some papers:– Mueller, Parmigiani et al. JASA (2004)– Rich Simon’s group Biostatistics (2005)– Tibshirani. A simple method for assessing sample
sizes in microarray experiments. BMC Bioinformatics. 2006 Mar 2;7:106.
Beyond Individual Genes:Functional Gene Groups
• Borrow statistical power across entire
dataset
• Beyond threshold enrichment
• Integrate preexisting biological knowledge
-0.4 -0.2 0.0 0.2 0.4
01
23
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
Correlation (r)
Den
sity
Correlation of Age with Gene Expression
Functional Annotation of Lists of Genes
KEGGPFAM
SWISS-PROTGO
DRAGONDAVID/EASEMatchMiner
BioConductor (R)
Gene Cross-Referencing and Gene Annotation Tools In BioConductor
Functions for accessing data in metadata packages.
Functions for accessing NCBI databases.
Functions for assembling HTML tables.
Annotation Tools In BioConductor:Annotation for Commercial Microarrays
Array-specific metadata packages
Annotation Tools In BioConductor:Functional Annotation with other DB’s
GO metadata package
Annotation Tools In BioConductor:Functional Annotation with other DB’s
KEGG metadata package
Is their enrichment in our list of differentially expressed genes for a particular functional gene
group or pathway?
Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups
Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups
Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups
The argument lower.tail will indicate if you are looking for over- or under- representation of differentially expressed genes within a particular functional group (using lower.tail=F for over-representation).
Can we use more of our data than Threshold Enrichment (that only uses
the top of our gene list)?
EXP#1
Swiss-Prot
PFAM
KEGG
Functional Gene Subgroups within An Experiment
Statistics for Analysis of Differential Expression of Gene Subgroups
Is THIS …
… Different from THIS?
Over-Expression of a Group of Functionally Related Genes
p<7.42e-08
T statistic
Statistical Tests:
2
Kolmogorov-SmirnovProduct of ProbabilitiesGSEAPAGEgeneSetTest (Wilcoxon rank sum)
Is THIS …
… Different from THIS?
Conceptually Distinct from Threshold Enrichment and the Hypergeometric test!
histogrambins
E
O
2
ED =
(O-E)2______
2 is the sum of D values where:
All Genes
Subset of Interest
All Genes
Subset of Interest
Kolmogorov-Smirnov
All Genes
Subset of Interest
Product of Individual Probabilities
What shape/type of distributions would each of these tests be sensitive to?
All statistics
Statistics from gene subgroup
Gene Set Enrichment Analysis (GSEA)
Subramanian et al, 2005 PNAS
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA)
Parametric Analysis of Gene Set Enrichment (PAGE)
Kim et al, 2005 BMC Bioinformatics
Parametric Analysis of Gene Set Enrichment (PAGE)
Z =Sm-
/m0.5
The test statistic used for the gene-set-test is the mean of the statistics in the set. If ranks.only is TRUE the only the ranks of the statistics are used. In this case the p-value is obtained from a Wilcoxon test. If ranks.only is FALSE, then the p-value is obtained by simulation using nsim random selected sets of genes.
Arguement: alternative = “mixed” or “either” : fundamentally different questions.
Test whether a set of genes is enriched for differential expression.
Common question in Common question in experimental designexperimental design
• Should I pool mRNA samples across subjects in an effort to reduce the effect of biological variability (or cost)?
Two simple designsTwo simple designs
• The following two designs have roughly the same cost:– 3 individuals, 3 arrays– Pool of three individuals, 3 technical
replicates
• To a statistician the second design seems obviously worse. But, I found it hard to convince many biologist of this.– 3 pools of 3 animals on individual arrays?
Cons of Pooling EverythingCons of Pooling Everything• You can not measure within class variation
• Therefore, no population inference possible
• Mathematical averaging is an alternative way of reducing variance.
• Pooling may have non-linear effects
• You can not take the log before you average:E[log(X+Y)] ≠ E[log(X)] + E[log(Y)]
• You can not detect outliers
*If the measurements are independent and identically distributed
Cons specific to microarraysCons specific to microarrays
• Different genes have dramatically different biological variances.
• Not measuring this variance will result in genes with larger biological variance having a better chance of being considered more important