Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for.

Selection of Differential Expression Genes in Microarray Experiments

James J. Chen, Ph.D.

Division of Biometry and Risk AssessmentNational Center for Toxicological Research

Food and Drug Administratione-mail: [email protected]

FDA/Industry Workshop

September 19, 2003

Analysis of Microarray Data

Class comparison: Identifying differentially expressed genes Class prediction: Association between genes and samples, selecting a minimal combination of genes (classification). Class discovery: discovery sample sub-types of gene clusters, selecting genes with similar expression pattern (cluster analysis)

Genesg1

g2

g3

.

.gm

S1

y11

y21

y31

.

.ym1

S2

y12

y22

y32

.

.ym2

Sn

y1n

y2n

y3n

.

.ymn

………………...

Samples

Identifying Differentially Expressed Genes

An important goal in the data analysis is to identify a set of genes that are differentially expressed among control and treated samples (groups).

To identify disease-related, drug-response, or biomarker genes (class comparison).

To enhance relationships among genes and samples for clustering or prediction (class prediction or class discovery).

Ranking Genes

The normalized data are analyzed one gene at a time (when there is sufficient number of replicates n) using statistical methods: ANOVA, permutation tests, ROC , etc.

Genesg1

g2

g3

.

.gm

S1

y11

y21

y31

.

.ym1

S2

y12

y22

y32

.

.ym2

Sn

y1n

y2n

y3n

.

.ymn

………………...

Samples

Rankr1 (p1)

r2 (p2)

r3 (p3)

. .

. .rm (pm)

P-value Approaches to Gene Selection

These are the mixture of altered and unaltered genes, altered genes should have smaller p-values.

How to choose a cut-off ?

P-value for Gene Ranking:

Use p-values to rank the genes in the order of evidence for differential expression: p(1) . . . p(m) (an ordered evidence of differences)

Determining Cut-off: fixed p-value, number of rejections, estimating the number altered gene, decision (ROC), Multiple testing Issue: FWE or FDR approach..

Approaches to Multiplicity Testing

Family-wise error (FWE) rate approach – controlling the probability of false rejection of unaltered genes among all hypotheses (genes in the array) tested.

False discovery rate (FDR) approach – estimating the probability of false rejection of unaltered genes among the rejected hypotheses (significant genes)

Two approaches to multiplicity testing:

Testing m hypotheses

Decision True State Significance Non-significance Total

Unaltered V S 1 - m0

Altered U 1- T m1 Total R m-R m

The number of true null hypotheses m0 is fixed but unknown. V and U are unobservable; R=U+V is observable. The FWE is the probability Pr(V 0). The FDR is E(V/R) (rejecting unaltered genes among the significances).

P-Value FWE Approach

FWE : The probability of rejecting at least one true null hypothesis in the given family of the hypotheses.

Bonferroni adjustment: set CWE at /m then FWE

Improvements: Holm (Scand J., 1979) step-down procedure:

(mp(1), (m-1)p(2), (m-2)p(3), . . . )

Estimating the number of un-altered genes m0: =FWE/m0

(m0p(1), m0p(2), m0p(3), . . . )

Since m0 << m, great improvement!

Estimating Number of True Nulls

Difference of two adjacent p-values:

dj = p(j) - p(j-1), j=1,..,(m+1), p(0) = 0, p(m+1) = 1 Under independence and H0, di Beta(1,m0) with mean

E(dj) =1/(m0+1).

An estimate of m0 is m0{MD} = 1/d -1 1/E(d) –1.

Graphic algorithm to estimate m0

Benjamini and Hochberg (J Edu Behav. Stat. 2000) Hsueh et al., J. Biopharm. Stat. (2003)

_

Simulation results for the m0{MD} estimator for m = 1,000,

based on 10,000 replicates.

Estimation: The effect size is set to have 80% power at the FWE = 25.The means and standard deviations (s.d.)

Independence Hypotheses Correlated Hypotheses ( = .25)

m0 Mean s.d. Mean s.d.

1000 999.35 10.89 992.30 42.29 900 904.43 3.43 899.16 36.47 700 709.40 5.26 703.13 37.07

Testing: Empirical familywise error rates at the FWE = 0.05, 010, 0.25.

Independence Hypotheses Correlated Hypotheses ( = .25)

m0 0.05 0.10 0.25 0.05 0.10 0.25

1000 0.049 0.098 0.223 0.039 0.071 0.151 900 0.049 0.095 0.224 0.040 0.070 0.142 700 0.047 0.090 0.213 0.039 0.070 0.142

P-value FDR Methods

FDR: The probability of falsely rejected null hypotheses.

FDR-controlled (BH, 1995): q-value = mp(r) /r < FDR

Fixed CWE = (Storey, 2002): estimate pFDR Fixed R = r (Tsai, 2003): estimate cFDR = E(V |R=r)/r. The expected number of false significances is (r x cFDR)

FDRs depend on the distributions of R and the conditional

distribution V|R. FDR = pFDR P(R>0) = cFDR Pr(R = r)

Chen (ICSA Bulletin,

2003)

Distribution of R and the cFDR for m = 1000 and m0=900 at =.01

and 1= 2. Assume paired t-test with five replicated arrays.

r Pr(R=r) cFDR r Pr(R=r) cFDR r Pr(R=r) cFDR

68 .0009 .0748 79 .0509 .0947 90 .0369 .1231

69 .0016 .0763 80 .0592 .0969 91 .0289 .1262

70 .0026 .0779 81 .0664 .0992 92 .0218 .1293

71 .0042 .0795 82 .0719 .1015 93 .0158 .1326

72 .0065 .0812 83 .0750 .1039 94 .0111 .1359

73 .0097 .0830 84 .0756 .1064 95 .0075 .1393

74 .0140 .0848 85 .0734 .1090 96 .0049 .1428

75 .0195 .0866 86 .0688 .1117 97 .0031 .1463

76 .0261 .0885 87 .0622 .1144 98 .0019 .1500

77 .0338 .0905 88 .0542 .1172 99 .0011 .1537

78 .0422 .0926 89 .0455 .1201 100 .0006 .1574

Unconditional estimates: FDR = .1067, pFDR = .1067, mFDR = .1075Condition at E(R) = 83.7 84 (mode), cFDR = .1064, eFDR=.1071.

FDR, pFDR, cFDR, and mFDR, at = .01 and .001; m = 100, and 1000, F0 F1under independence. The cFDR are evaluated at [E(R)+1]

= .01 = .001 m m0 FDR pFDR cFDR mFDR FDR pFDR cFDR mFDR

100 50 .0257 .0257 .0261 .0262 .0071 .0071 .0071 .0072

80 .0933 .0933 .0960 .0971 .0258 .0271 .0270 .0282

90 .1824 .1831 .1857 .1948 .0462 .0583 .0586 .0613

95 .3012 .3129 .3119 .3380 .0650 .1147 .1163 .1212

100 .6340 1. 1. 1. .0952 1. 1 . 1.

1000 500 .0261 .0261 .0261 .0262 .0072 .0072 .0072 .0072

800 .0967 .0967 .0969 .0971 .0281 .0281 .0282 .0282

900 .1935 .1935 .1946 .1948 .0608 .0608 .0609 .0613

950 .3351 .3351 .3383 .3380 .1193 .1194 .1194 .1212

1000 .9999 1. 1. 1. .6324 1. 1. 1.

Conditional Distribution of V | R=r

Given m0 and , the number of rejections R = V+U, where V Bin(m0,) and U Bin(m1,1-)

The conditional distribution V|R = r has the non-central hypergeometric distribution.

The cFDR = E(V |R=r)/r estimated from the mean of V|R. It can also be computed from distribution of R

To estimate cFDR: mo{MD} and distribution of R (parametric

or bootstrap method)

),|(

)1,1|1()

m( FDR

0

0 0

mmrRP

mmrRP

rc

Taiwan Academia Sinica (Metal) Data*

Control and 8 metals, 55 one-channel arrays, 684 genes* Data from Dr. D. T. Lee’s laboratory

Identifying DE Genes: Sinica Data

Objective: Control vs. As vs. Cd. Design: 6 arrays per group (I, III, IV, VI, VII, IX ; 18 arrays) Microarray: As-chip-TCL01 (one-channel membrane array)

Probes: 708 genes with 16 house keeping genes. Data filtering: Spots with more than 3 zero/negative intensity were removed resulted in 540 genes.

Gene Expression matrix: 540 (genes) x 18 (arrays).

Normalization: GAM (lowess) to adjust for array effects.

Significance test:The p-values were computed using the F statistic from all 18C12 12C6 permutations.

MCP Analysis of Sinica Data

Total number of genes: m = 540

Estimated number of un-altered genes:m0{MD} = 444

Number of rejections (r):

FWE = 0.05, 0.05/444: r = 11 0.05/540: r = 9

FDR = 0.05, = (0.05 x r)/444: r = 39 0.05 x r)/540: r = 27 CWE = = 0.01: r = 50 m1

{MD} = 96: r = 96

The FDR, pFDR, cFDR, and eFDR estimates are close.

pFDR and cFDR Estimates using Different MCP Methods

MCP r p(r)

pFDR cFDR v*

FWE(0.05) 1.13x10-4* 11 1.12x10-4 4.52x10-3 4.50x10-3 .5 FDR(0.05) 4.39x10-3* 39 4.29x10-3 4.87x10-2 4.88x10-2 2

CWE(0.01) 0.01 50 9.97x10-3 8.85x10-2 8.59x10-2 4

M1{MD} 96 3.28x10-2 1.51x10-1 1.53x10-1 15

* FWE(0.05/444; FDR(0.05 x r)/444; *v = r x cFDR

^

v

* m = 540 and m0{MD} = 444

Association Study

Relationships between genes and samples: Effects of drugs (toxicants) on gene expression profiles, DNA diagnostic testing, or pathogen detection (classification).

Relationships among samples: Molecular classification of different tissue types or samples on the basis of gene expression (cluster analysis).

Relationships among genes: Genes of similar function yield similar expression patterns in microarray experiments (metabolic pathways, molecular function,

biological process, etc.) (cluster analysis)

Class Prediction

Class prediction (classification): to develop a decision rule to predict the class membership of a new sample based on the expression profiles of some key genes.

Three Steps:

Selection of the discriminatory (key) gene set.

1. Formation of the discrimination rule: Fisher’s linear discriminant function, nearest-neighbor classifiers, support vector machines, and classification tree.

2. Cross-validation to estimate accuracy of the prediction

Class Prediction: Sinica Data

Nine different treatments: Control, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV for a total of 55 samples (arrays).

Number of Genes: 684 genes (some 2- or 3-plicates). Gene Expression matrix: 684 (genes) x 55 (arrays). Normalization: GAM (lowess) to adjust for array effects. Gene Sets: Five gene sets are considered.

Classification methods: Fisher’s linear discriminant function, nearest-neighbor classifiers (k-nn)

Cross-validation: 10-fold cross-validation, 11 arrays/group.

Selections of Discriminatory genes

Significance testing approach to gene selection:

1. F: Differential expression (global) genes among the 9 groups using F test with FWE = 0.05. 38 genes

T Treatment-specific marker genes, One-Vs-All t-test compares each group with 8 remaining groups with adjusted p = 0.01, Gi. T= G1U … U G9 89 genes

I = F TIntersection of F and T 25 genes

4. U = F U T Union of F and T 102 genes

5. Original gene set 684 genes

Average accuracy (%) of k-NN multi-class classification, based on 11-fold cross-validation over 1,000 permutations.

Metal n I F T U A # of genes 25 38 89 102 684

10099.175.582.061.676.360.081.351.8

81.6

10099.178.684.478.599.742.481.472.7

85.3

98.498.799.881.838.299.537.198.746.0

82.0

98.898.897.181.541.597.897.181.745.8

80.5

79.096.638.750.457.194.918.378.745.8

65.6

14 7 5 6 5 4 5 7 5

55

Control As AsV Cd Cu Ni Cr Sb Pb

TotalThe FLDA algorithm performed poorly, for example, the overall accuracies are 67.9% and 40.5% for I and F respectively.

Cluster analysis with a 2-MDS plot for the treatment-specific marker genes in I: Each gene is labeled with

the compound to which it gives a unique expression.

Metal I Ctrl 7

As 1

AsV 1

Cd 3

Cu 2

Ni 4

Cr 1

Sb 8

Pb 0(1-) metric, complete linkage

Clustering results with 2-MDS plots for the 55 arrays for the genes I and A

Gene setI (25 genes) Gene set A (684 genes)

Acknowledgements Collaborators and Contributors

Dr. Frank Sistare & Staff (CDER/FDA; Merck) Dr. Sue-Jane Wang (CDER/FDA) Dr. T-C Lee & Staff (Academia Sinica,Taiwan) Dr. C-h Chen & Staff (Academia Sinica,Taiwan)

Dr. Suzanne Morris & Staff (NCTR) Dr. Jim Fuscoe & Staff (NCTR) Dr. Ralph Kodell NCTR) Dr. Robert Delongchamp (NCTR)

Dr. Hueymiin Hsueh (Cheng-chi Univ.,Taiwan) Dr. Chen-an Tsai (NCTR) Ms. Yi-Ju Chen (Pen State, NCTR)

Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for.

Documents

r cfdr r prr

distribution of r

total r

distributions of r

unobservable r

r tsai

hypotheses genes

pfdr fixed r