Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for Toxicological Research Food and Drug Administration e-mail: Jchen @ nctr.fda.gov FDA/Industry Workshop September 19, 2003
26
Embed
Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Selection of Differential Expression Genes in Microarray Experiments
James J. Chen, Ph.D.
Division of Biometry and Risk AssessmentNational Center for Toxicological Research
Class comparison: Identifying differentially expressed genes Class prediction: Association between genes and samples, selecting a minimal combination of genes (classification). Class discovery: discovery sample sub-types of gene clusters, selecting genes with similar expression pattern (cluster analysis)
Genesg1
g2
g3
.
.gm
S1
y11
y21
y31
.
.ym1
S2
y12
y22
y32
.
.ym2
Sn
y1n
y2n
y3n
.
.ymn
………………...
Samples
Identifying Differentially Expressed Genes
An important goal in the data analysis is to identify a set of genes that are differentially expressed among control and treated samples (groups).
To identify disease-related, drug-response, or biomarker genes (class comparison).
To enhance relationships among genes and samples for clustering or prediction (class prediction or class discovery).
Ranking Genes
The normalized data are analyzed one gene at a time (when there is sufficient number of replicates n) using statistical methods: ANOVA, permutation tests, ROC , etc.
Genesg1
g2
g3
.
.gm
S1
y11
y21
y31
.
.ym1
S2
y12
y22
y32
.
.ym2
Sn
y1n
y2n
y3n
.
.ymn
………………...
Samples
Rankr1 (p1)
r2 (p2)
r3 (p3)
. .
. .rm (pm)
P-value Approaches to Gene Selection
These are the mixture of altered and unaltered genes, altered genes should have smaller p-values.
How to choose a cut-off ?
P-value for Gene Ranking:
Use p-values to rank the genes in the order of evidence for differential expression: p(1) . . . p(m) (an ordered evidence of differences)
Determining Cut-off: fixed p-value, number of rejections, estimating the number altered gene, decision (ROC), Multiple testing Issue: FWE or FDR approach..
Approaches to Multiplicity Testing
Family-wise error (FWE) rate approach – controlling the probability of false rejection of unaltered genes among all hypotheses (genes in the array) tested.
False discovery rate (FDR) approach – estimating the probability of false rejection of unaltered genes among the rejected hypotheses (significant genes)
Two approaches to multiplicity testing:
Testing m hypotheses
Decision True State Significance Non-significance Total
Unaltered V S 1 - m0
Altered U 1- T m1 Total R m-R m
The number of true null hypotheses m0 is fixed but unknown. V and U are unobservable; R=U+V is observable. The FWE is the probability Pr(V 0). The FDR is E(V/R) (rejecting unaltered genes among the significances).
P-Value FWE Approach
FWE : The probability of rejecting at least one true null hypothesis in the given family of the hypotheses.
Fixed CWE = (Storey, 2002): estimate pFDR Fixed R = r (Tsai, 2003): estimate cFDR = E(V |R=r)/r. The expected number of false significances is (r x cFDR)
FDRs depend on the distributions of R and the conditional
Given m0 and , the number of rejections R = V+U, where V Bin(m0,) and U Bin(m1,1-)
The conditional distribution V|R = r has the non-central hypergeometric distribution.
The cFDR = E(V |R=r)/r estimated from the mean of V|R. It can also be computed from distribution of R
To estimate cFDR: mo{MD} and distribution of R (parametric
or bootstrap method)
),|(
)1,1|1()
m( FDR
0
0 0
mmrRP
mmrRP
rc
Taiwan Academia Sinica (Metal) Data*
Control and 8 metals, 55 one-channel arrays, 684 genes* Data from Dr. D. T. Lee’s laboratory
Identifying DE Genes: Sinica Data
Objective: Control vs. As vs. Cd. Design: 6 arrays per group (I, III, IV, VI, VII, IX ; 18 arrays) Microarray: As-chip-TCL01 (one-channel membrane array)
Probes: 708 genes with 16 house keeping genes. Data filtering: Spots with more than 3 zero/negative intensity were removed resulted in 540 genes.
Gene Expression matrix: 540 (genes) x 18 (arrays).
Normalization: GAM (lowess) to adjust for array effects.
Significance test:The p-values were computed using the F statistic from all 18C12 12C6 permutations.
MCP Analysis of Sinica Data
Total number of genes: m = 540
Estimated number of un-altered genes:m0{MD} = 444
Number of rejections (r):
FWE = 0.05, 0.05/444: r = 11 0.05/540: r = 9
FDR = 0.05, = (0.05 x r)/444: r = 39 0.05 x r)/540: r = 27 CWE = = 0.01: r = 50 m1
{MD} = 96: r = 96
The FDR, pFDR, cFDR, and eFDR estimates are close.
pFDR and cFDR Estimates using Different MCP Methods
Relationships between genes and samples: Effects of drugs (toxicants) on gene expression profiles, DNA diagnostic testing, or pathogen detection (classification).
Relationships among samples: Molecular classification of different tissue types or samples on the basis of gene expression (cluster analysis).
Relationships among genes: Genes of similar function yield similar expression patterns in microarray experiments (metabolic pathways, molecular function,
biological process, etc.) (cluster analysis)
Class Prediction
Class prediction (classification): to develop a decision rule to predict the class membership of a new sample based on the expression profiles of some key genes.
Three Steps:
Selection of the discriminatory (key) gene set.
1. Formation of the discrimination rule: Fisher’s linear discriminant function, nearest-neighbor classifiers, support vector machines, and classification tree.
2. Cross-validation to estimate accuracy of the prediction
Class Prediction: Sinica Data
Nine different treatments: Control, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV for a total of 55 samples (arrays).
Number of Genes: 684 genes (some 2- or 3-plicates). Gene Expression matrix: 684 (genes) x 55 (arrays). Normalization: GAM (lowess) to adjust for array effects. Gene Sets: Five gene sets are considered.
Classification methods: Fisher’s linear discriminant function, nearest-neighbor classifiers (k-nn)
1. F: Differential expression (global) genes among the 9 groups using F test with FWE = 0.05. 38 genes
T Treatment-specific marker genes, One-Vs-All t-test compares each group with 8 remaining groups with adjusted p = 0.01, Gi. T= G1U … U G9 89 genes
I = F TIntersection of F and T 25 genes
4. U = F U T Union of F and T 102 genes
5. Original gene set 684 genes
Average accuracy (%) of k-NN multi-class classification, based on 11-fold cross-validation over 1,000 permutations.
Metal n I F T U A # of genes 25 38 89 102 684
10099.175.582.061.676.360.081.351.8
81.6
10099.178.684.478.599.742.481.472.7
85.3
98.498.799.881.838.299.537.198.746.0
82.0
98.898.897.181.541.597.897.181.745.8
80.5
79.096.638.750.457.194.918.378.745.8
65.6
14 7 5 6 5 4 5 7 5
55
Control As AsV Cd Cu Ni Cr Sb Pb
TotalThe FLDA algorithm performed poorly, for example, the overall accuracies are 67.9% and 40.5% for I and F respectively.
Cluster analysis with a 2-MDS plot for the treatment-specific marker genes in I: Each gene is labeled with
the compound to which it gives a unique expression.
Metal I Ctrl 7
As 1
AsV 1
Cd 3
Cu 2
Ni 4
Cr 1
Sb 8
Pb 0(1-) metric, complete linkage
Clustering results with 2-MDS plots for the 55 arrays for the genes I and A
Gene setI (25 genes) Gene set A (684 genes)
Acknowledgements Collaborators and Contributors
Dr. Frank Sistare & Staff (CDER/FDA; Merck) Dr. Sue-Jane Wang (CDER/FDA) Dr. T-C Lee & Staff (Academia Sinica,Taiwan) Dr. C-h Chen & Staff (Academia Sinica,Taiwan)
Dr. Suzanne Morris & Staff (NCTR) Dr. Jim Fuscoe & Staff (NCTR) Dr. Ralph Kodell NCTR) Dr. Robert Delongchamp (NCTR)
Dr. Hueymiin Hsueh (Cheng-chi Univ.,Taiwan) Dr. Chen-an Tsai (NCTR) Ms. Yi-Ju Chen (Pen State, NCTR)