A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples

The International Journal ofBiostatistics

Volume 4, Issue 1 2008 Article 20

A Marginal Mixture Model for SelectingDifferentially Expressed Genes across Two

Types of Tissue Samples

Weiliang Qiu∗ Wenqing He†

Xiaogang Wang‡ Ross Lazarus∗∗

∗Brigham and Women’s Hospital and Harvard Medical School, [email protected]†University of Western Ontario, [email protected]‡York University, [email protected]

∗∗Brigham and Women’s Hospital and Harvard Medical School,[email protected]

Copyright c©2008 The Berkeley Electronic Press. All rights reserved.

A Marginal Mixture Model for SelectingDifferentially Expressed Genes across Two

Types of Tissue Samples∗

Weiliang Qiu, Wenqing He, Xiaogang Wang, and Ross Lazarus

Abstract

Bayesian hierarchical models that characterize the distributions of (transformed) gene profileshave been proven very useful and flexible in selecting differentially expressed genes across dif-ferent types of tissue samples (e.g. Lo and Gottardo, 2007). However, the marginal mean andvariance of these models are assumed to be the same for different gene clusters and for differ-ent tissue types. Moreover, it is not easy to determine which of the many competing Bayesianhierarchical models provides the best fit for a specific microarray data set. To address these twoissues, we propose a marginal mixture model that directly models the marginal distribution oftransformed gene profiles. Specifically, we approximate the marginal distributions of transformedgene profiles via a mixture of three-component multivariate Normal distributions, each compo-nent of which has the same structures of marginal mean vector and covariance matrix as those forBayesian hierarchical models, but the values can differ. Based on the proposed model, a methodis derived to select genes differentially expressed across two types of tissue samples. The de-rived gene selection method performs well on a real microarray data set and consistently has thebest performance (based on class agreement indices) compared with several other gene selectionmethods on simulated microarray data sets generated from three different mixture models.

KEYWORDS: Box-Cox transformation, differentially expressed gene, EM algorithm, hierarchi-cal structure, mixture models, posterior probability

∗Thanks to Drs. Harry Joe, Vincent Carey, Alvin Kho, and two referees for valuable commentsand suggestions. This research was supported by NIH grant 1 R01 HG003646-01A1, and He’sand Wang’s research was funded by the Natural Sciences and Engineering Research Council ofCanada.

1 IntroductionMicroarray technology allows simultaneous measurement of the expression levelsof thousands of genes within a biological tissue sample. By comparing the arraysof different types of tissue samples (e.g., abnormal versus normal), researchers caninvestigate the joint effects of groups of genes on diseases. This information mayhelp develop methods to diagnose, or even cure, diseases. An important step in theanalysis of microarray data is to identify genes differentially expressed across typesof tissue samples. Powerful gene selection helps pinpoint genes affecting diseasestatus.

Microarray data are so-called high-dimensional-low-sample-size data, becausethe number of genes (variables) is far bigger than the number of tissue samples (datapoints). Hence conventional variable selection methods cannot be directly appliedto select differentially expressed genes. One common gene selection approach isto first invoke a statistical hypothesis test (e.g., two-sample t-test or Wilcoxon test)for each gene, and then to claim that genes are differentially expressed if the cor-responding p-values of the tests fall below a threshold. A potential limitation ofthis approach is that the variance estimates might be unstable due to small samplesize. Several approaches have been proposed to stabilize the variance estimates forthe two-sample t-test (e.g. SAM, proposed by Tusher et al. (2001)). Another po-tential limitation is the so-called multiple testing problem. One remedy is to adjustp-values of the tests so that family-wise error rate (FWER) or false-discovery rate(FDR) is controlled.

Instead of adjusting p-values, Efron et al. (2001) directly modeled the distri-bution of the summary statistics (e.g. two-sample t-test statistics) of gene profilesvia a mixture of two-component univariate distributions. One component of themixture of distributions corresponds to the distribution of summary statistics fordifferentially expressed genes. The other corresponds to the distribution of sum-mary statistics for nondifferentially expressed genes. The gene-cluster membershipis then determined based on the posterior probability that a gene belongs to a clustergiven its summary statistic. An advantage of this approach is that multiple testing isnot involved, and information from a large number of genes can be used. Publica-tions by Pan (2002), He (2004), Do et al. (2005), and McLachlan et al. (2006) alsodiscussed this approach. Broet et al. (2002) extended this approach by proposing asummary statistic that can detect different levels of gene expression changes. Broetet al. (2004) later improved Broet et al. (2002) to handle more than two classes oftissue samples.

Lee et al. (2000) viewed the microarray data from a different angle: regardinggene profiles as data points and tissue samples as variables, allowing microarraydata to be viewed as low-dimensional-large-sample-size data. Lee et al. (2000)

1

Qiu et al.: A Marginal Mixture Model for Gene Selection

Published by The Berkeley Electronic Press, 2008

https://www.researchgate.net/publication/7149275_A_simple_implementation_of_a_normal_mixture_approach_to_differential_gene_expression_in_multiclass_microarrays_Bioinformatics?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/12357719_Lee_M-LT_Kuo_FC_Whitmore_GA_Sklar_J_Importance_of_replication_in_microarray_gene_expression_studies_statistical_methods_and_evidence_from_repetitive_cDNA_hybridizations_Proc_Natl_Acad_Sci_USA_97_9834-?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=


https://www.researchgate.net/publication/11117100_Bayesian_Hierarchical_Model_for_Identifying_Changes_in_Gene_Expression_from_Microarray_Experiments?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=


https://www.researchgate.net/publication/8587052_A_mixture_model-based_strategy_for_selecting_genes_in_multiclass_response_microarray_experiments?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=


https://www.researchgate.net/publication/11355741_Pan_W_A_comparative_review_of_statistical_methods_for_discovering_differentially_expressed_genes_in_replicated_microarray_experiments_Bioinformatics_18_546-554?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/4772810_A_Bayesian_mixture_model_for_differential_gene?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/12024131_Significance_Analysis_of_Microarrays_Applied_to_The_Ionizing_Radiation_Response?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/4743756_Empirical_Bayes_Analysis_of_a_Microarray_Experiment?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

and its extension (Lee et al., 2002) directly modeled the distributions of the geneprofiles by a mixture of two-component multivariate distributions. Gene-clustermembership can then be determined by the posterior probability that a gene belongsto a gene cluster given its profile. Multiple testing is not involved in this approach.Moreover, information could be borrowed across genes, hence this approach has thepotential to perform well even for a small number of tissue samples. One potentiallimitation of this approach is the assumption that different genes in the same clusterhave the same conditional mean vectors and covariance matrices.

Modeling component distributions by hierarchical distributions allows moreflexible models that have different conditional mean vectors and covariance ma-trices for genes in the same cluster. Several such Bayesian hierarchical modelshave been proposed, such as GG (Gamma-Gamma) (Newton et al. 2001), LNN(LogNormal-Normal) (Kendziorski, et al, 2003), eGG (extended GG) (Lo and Got-tardo, 2007), and eLNN (extended LNN) (Lo and Gottardo, 2007).

We observed that Bayesian hierarchical models have special structures of marginalmean vectors and covariance matrices (see Appendix A). Specifically, the marginalmeans and variances of gene expression levels for different tissue types are as-sumed to be the same. It is possible to allow Bayesian hierarchical models to havetissue-type specific hyperparameters so that we can distinguish the marginal mean-vector and covariance-matrix structures of different tissue types. However, it wouldbe difficult to interpret these hyperparameters. Moreover, different choices of theconditional distribution and prior distribution will result in different Bayesian hier-archical models. This raises the question of which model is the “best” for a givenset of microarray data. Moreover, when one applies Bayesian hierarchical models,the marginal distributions have to be derived first to calculate the posterior proba-bility that a gene belongs to a gene cluster given its profile (e.g. Lo and Gottardo,2007). Sometimes these marginal distributions are not easy to derive. Further-more, in practice investigators are usually interested in a three-cluster partition ofgenes: genes over-expressed in abnormal tissue samples, genes non-differentiallyexpressed, and genes under-expressed in abnormal tissue samples.

These observations motivate us to propose a new model that (1) allows tissue-type-specific hyperparameters; and (2) explicitly distinguishes between genes over-expressed and underexpressed in abnormal tissue samples. Most importantly, in-stead of constructing Bayesian hierarchical models, we directly model the marginaldistributions of transformed gene profiles. Specifically, we approximate the marginaldistributions of transformed gene profiles via a mixture of three-component multi-variate Normal distributions, each component of which has the same structures ofmean-vector and covariance-matrix as those of Bayesian hierarchical models. Wecall this the marginal mixture model (MMD).

The remainder of this article is structured as follows. In Section 2, we describe

2

The International Journal of Biostatistics, Vol. 4 [2008], Iss. 1, Art. 20

http://www.bepress.com/ijb/vol4/iss1/20

MMD, derive a gene selection method, and present estimation formulae of severalerror rates for real microarray data sets. In Section 3, we use a real microarraydata set to illustrate MMD. In Section 4, we use simulated microarray data setsgenerated from three different mixture models to assess the performance of theproposed MMD framework. We will use the same names of the mixture models torefer to gene selection methods based on mixture models. Finally we comment onthe proposed framework and possible extensions in Section 5.

2 Marginal mixture modelLet X = (X1, X2, . . ., Xmc , Xmc+1, Xmc+2, . . ., Xmc+mn)T , a m × 1 vector, bethe transformed gene profile for a randomly selected gene over m tissue samples(m = mc + mn, where mc is the number of abnormal tissue samples and mn

normal tissue samples). We assume that data have been normalized to remove theeffects of confounding factors, such as dye effect, chip effect, batch effect, etc..The distribution of X is assumed to be a three-component mixture of multivariateNormal distributions with marginal density:

f(x|θ1, θ2, θ3) = π1f1(x|θ1) + π2f2(x|θ2) + π3f3(x|θ3),

π1 + π2 + π3 = 1, πi > 0, i = 1, 2, 3,(1)

where π1, π2, π3 are mixture proportions. The m × 1 vector x is a realization ofthe random vector X; θk, is the parameter set for the k-th component distributionfk, k = 1, 2, 3; and f1, f2, and f3 are the density functions for multivariate Normaldistributions with the mean vectors

µ1 =

(µc11mc

µn11mn

), µ2 = µ21m, µ3 =

(µc31mc

µn31mn

). (2)

and covariance matrices

Σ1 =

(σ2

c1Rc1 00 σ2

n1Rn1

), Σ2 = σ2

2R2, Σ3 =

(σ2

c3Rc3 00 σ2

n3Rn3

),

(3)respectively, where correlation matrix

Rt = (1− ρt)

[Int +

ρt

(1− ρt)1nt1

Tnt

], (4)

t = c1, n1, 2, c3, or n3. nt = mc if t = c1 or c3; nt = m if t = 2; nt = mn ift = n1, or n3. Here we assume, without loss of generality, that the first mc elements

3



of the random vector X are for the abnormal tissue samples and the remaining mn

elements are for the normal tissue samples. Let θ1 = (µc1 , σ2c1

, ρc1 , µn1 , σ2n1

, ρn1)T ,

θ2 = (µ2, σ22, ρ2)

T , θ3 = (µc3 , σ2c3

, ρc3 , µn3 , σ2n3

, ρn3)T . Note that µc1 > µn1 for

component 1 in which genes are overexpressed in abnormal tissue samples, andµc3 < µn3 for component 3 where genes are underexpressed in abnormal sam-ples. Our prior belief is that the majority of genes are usually non-differentiallyexpressed, so we assume π2 > π1 and π2 > π3.

Model (1) assumes that (a) marginal means and variances of expression levelsfor a given gene from the same type of tissue samples are the same; (b) marginal cor-relations between any pair of expression levels for a given differentially expressedgene from the same type of tissue samples are the same; (c) marginal correlationsbetween any pair of expression levels for a given gene from different types of tissuesamples are zero; (d) gene profiles in the same gene cluster have the same marginaldistributions; and (e) gene profiles are marginally independent. These assumptionscapture the structural information of microarray data. For instance, tissue samplesof the same type are usually assumed to be from the same population and hencehave the same marginal distribution; tissue samples of different types are usuallyassumed to be from different populations and hence marginally independent. Theseassumptions are commonly imposed on microarray data to characterize their struc-ture (e.g., Newton et al, 2001; Kendziorski et al, 2003; Lo and Gottardo, 2007).

We would like to emphasize that MMD could approximate a wide range ofBayesian hierarchical models, including those that allow both gene-specific condi-tional means and gene-specific conditional variances, as long as we can find appro-priate transformations to transform the distribution of gene expression levels closeto a Normal distribution.

In practice, transforming data (e.g., natural logarithm transformation) is routinein the analysis of microarray data to stabilize variation of gene expression levelsand to obtain desirable statistical properties, such as normality (Lee, 2004). Weuse the Box-Cox transformation (Box and Cox, 1964), which includes natural log-arithm transformation as a special case. Although Box-Cox transformation, whichis a family of power transformations, cannot always transform a non-Normal dis-tribution to an exact Normal distribution, Draper and Cox (1969) pointed out that“the transformation estimated will correspond to a distribution . . . that is usually toa nearly symmetrical distribution.” We apply the same Box-Cox transformation forall gene expression levels. Appendix B gives details on how we choose Box-Coxtransformation based on a QQ-plot, which is a common tool to assess normality ofdata.

To make gene clusters tighter and more isolated, gene-profile scaling is recom-mended after Box-Cox transformation so that expression levels of each gene havemean zero and variance one. After Box-Cox transformation and gene-profile scal-

4



https://www.researchgate.net/publication/6659527_Flexible_empirical_Bayes_models_for_differential_gene_expression_Bioinformatics?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/11997103_On_Differential_Variability_of_Expression_Ratios_Improving_Statistical_Inference_About_Gene_Expression_Changes_From_Microarray_Data?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/8960769_On_Parametric_Empirical_Bayes_Methods_for_Comparing_Multiple_Groups_Using_Replicated_Gene_Expression_Profiles?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

ing, the within-cluster gene profile variation tends to be smaller than the between-cluster gene profile variation, and the marginal distribution of expression levels fornon-differentially expression genes would be close to the standard Normal distribu-tion N (0, 1).

From Model (1), a gene selection method can be derived. The i-th gene isassigned to one of the three clusters based on its posterior probability

Pr(gene i ∈ cluster k|xi) =πkfk(xi|θk)

π1f1(xi|θ1) + π2f2(xi|θ2) + π3f3(xi|θ3),

k =1, 2, 3,

(5)

where xi is the (transformed) profile for the i-th gene, i = 1, . . . , p. Specifically,we classify gene i with profile xi to cluster Ck0 if the posterior probability that thei-th gene belongs to cluster Ck0 given xi is the largest, i.e.,

k0 = arg maxk=1,2,3

Pr(gene i ∈ Ck|xi). (6)

The estimates θk, k = 1, 2, 3 can be obtained via the EM algorithm (Demp-ster et al, 1977). More details are shown in Appendix C. It is well-known that theEM algorithm may be sensitive to the choice of initial values of model parame-ters, so several different initial values are often used, choosing the final result thatmaximizes the likelihood function. We use two initial sets of model parameters.One is based on gene-wise t-tests; the other is based on a model-based clustering(Mclust) algorithm (Freely and Rafters, 1999). Details are given in Appendix D.The stopping criterion is given in Appendix E.

Several error rates have been proposed to assess the performance of a geneselection method from different perspectives, including FDR (the percentage ofnondifferentially expressed genes among selected genes), FNDR (the percentageof differentially expressed genes among unselected genes), FPR (the percentage ofselected genes among nondifferentially expressed genes), and FNR (the percent-age of un-selected genes among differentially expressed genes). It is challengingto estimate these error rates for real data sets, because whether a gene is differen-tially expressed is usually unknown. One approach to estimating these error ratesfor real microarray data sets is the model-based approach, which first models thedistribution of gene profiles, test statistics, or transformed p-values, then derives thetheoretical formulae for the error rates, and finally uses in the estimates of the un-known parameters in the formulae (e.g., Efron et al. 2001; Efron, 2004; Pakistan etal. 2005; McLachlan et al. 2004, 2006). Based on Model (1) and the gene selectioncriterion (6), the formulae of FDR, FNDR, FPR, and FNR for MMD can be derivedas: FDR = Pr(X ∈ C2|π2f2(X) < maxk 6=2 πkfk(X)), FNDR = Pr(X /∈

5




https://www.researchgate.net/publication/37621655_Analyzing_Microarray_Gene_Expression_Data?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/243103534_Large-Scale_Simultaneous_Hypothesis_Testing_The_Choice_of_a_Null_Hypothesis?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/246784707_Maximum_Likelihood_from_Incomplete_Data_via_the_EM_Algorithm?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=



C2|π2f2(X) = maxk=1,2,3 πkfk(X)), FPR = Pr(π2f2(X) < maxk 6=2 πkfk(X)|X ∈C2), FNR = Pr(π2f2(X) = maxk=1,2,3 πkfk(X)|X /∈ C2), where C2 denotes thecluster of nondifferentially expressed genes. While the above probabilities are an-alytically intractable, we can estimate these probabilities based on the results fromMcLachlan et al. (2004) :

FDR =

∑pi=1 W2(xi)I[π2f2(xi|θ2)<maxk 6=2 πkfk(xi|θk))]∑p

i=1 I[π2f2(xi|θ2)<maxk 6=2 πkfk(xi|θk))]

,

FNDR =

∑pi=1[1− W2(xi)]I[π2f2(xi|θ2)=maxk=1,2,3 πkfk(xi|θk))]∑p

i=1 I[π2f2(xi|θ2)=maxk=1,2,3 πkfk(xi|θk))]

,

FPR =

∑pi=1 W2(xi)I[π2f2(xi|θ2)<maxk 6=2 πkfk(xi|θk))]∑p

i=1 W2(xi),

FNR =

∑pi=1[1− W2(xi)]I[π2f2(xi|θ2)=maxk=1,2,3 πkfk(xi|θk))]∑p

i=1[1− W2(xi)],

(7)

where IA is the indicator function, which is one if event A is true and zero otherwise.The function W2(xi) is defined as W2(xi) = π2f2(xi|θ2)/f(xi|θ1, θ2, θ3).

3 ExampleIn this section we use a publicly available real microarray data set, initially stud-ied in Golub et al. (1999), to illustrate MMD. The Golub data set we use consistsof 3, 051 gene profiles over 11 acute myeloid leukemia tissue samples (denoted asAML and regarded as abnormal in MMD) and 27 acute lymphoblastic leukemiasamples (denoted as ALL and regarded as normal in MMD), pre-processed by themethod described in Dudoit et al. (2002). The upper panel of Figure 1 showsthe histograms of expression levels for AML samples and ALL samples. The his-tograms in the lower panel show that the Box-Cox transformation and gene-profilescaling transformed the non-normal Golub data close to Normal.

The parameter estimates by MMD are: π1 = 0.158, π2 = 0.611, π3 = 0.231,µc1 = 0.715, σ2

c1 = 1.024, ρc1 = −0.056, µn1 = −0.291, σ2c1 = 0.660, ρc1 =

−0.028, µ2 = 0.000, σ22 = 0.974, ρc1 = −0.027, µc3 = −0.654, σ2

c3 = 0.573,ρc3 = −0.018, µn3 = 0.266, σ2

n3 = 0.892, ρc3 = −0.030. The estimated errorrates are FDR = 0.089, FNDR = 0.054, FPR = 0.057, and FNR = 0.085.The estimated marginal correlations are close to zero, indicating that tissue samplesof the transformed Golub data set might be marginally independent. The densityestimates obtained by assuming ρc1 = ρn1 = ρ2 = ρc3 = ρn3 = 0 fit the histograms

6



https://www.researchgate.net/publication/12779876_Molecular_Classification_of_Cancer_Class_Discovery_and_Class_Prediction_by_Gene_Expression?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=


https://www.researchgate.net/publication/2426271_Comparison_of_Discrimination_Methods_for_the_Classification_of_Tumors_Using_Gene_Expression_Data_Tech_Rep_576?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

AML

expression level

dens

ity

−2 −1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

ALL

expression level

dens

ity

−1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

Histogram of gene expression levels for AML casesimposed with estimated density (case)

expression level

dens

ity

−4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

overallcomponent1component2component3

Histogram of gene expression levels for ALL casesimposed with estimated density (control)

expression level

dens

ity

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

overallcomponent1component2component3

Figure 1: Histograms of expression levels for AML samples and ALL samples.Upper panel: Pre-processing was done as described in Dudoit et al. (2002); Lowerpanel: After Dudoit et al.’s pre-processing, Box-Cox transformation and gene pro-file scaling.

well (see the lower panel of Figure 1). Table 1 shows the mean posterior probabili-ties to belong to classes 1, 2 or 3 in each posterior class, indicating that the overlapbetween cluster 1 and cluster 3 is small and there is some overlap between cluster1 and cluster 2 and between cluster 2 and cluster 3.

To study the performance of MMD relative to other gene selection methods, wecompare MMD with the empirical Bayesian method (denoted as EB), gene selec-tion methods based on GG, LNN, eGG, and eLNN models, and two methods weused to generate initial values of MMD model parameters: one is a gene-wise T-test (denoted as gwTtest), and the other is based on Mclust (Fraley and Raftery,1999). We designated this revised Mclust as rMclust.

We perform Box-Cox transformation and gene-profile scaling before applyingMMD, gwTtest, and rMclust. We apply EB for both untransformed and trans-formed data, and designate the two approaches as EB and EB.t, respectively. Weapply GG, eGG, LNN, and eLNN directly to the original data set. For this realdata set, we follow the example of Efron (2001), using use 0.90 as the cutoff for

7




https://www.researchgate.net/publication/2858059_MCLUST_Software_for_Model-Based_Clustering_Density_Estimation_and_Discriminant_Analysis?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=


Table 1: Mean posterior probability of belonging to classes 1, 2, or 3 in each poste-rior class

To class 1 To class 2 To class 3posterior class 1 0.909 0.091 0.000posterior class 2 0.023 0.946 0.031posterior class 3 0.000 0.088 0.912

posterior probability when applying EB and EB.t.The estimated proportions of differentially expressed genes are: 0.389 (MMD),

0.332 (GG), 0.301 (eGG), 0.315 (LNN), 0.335 (eLNN), 0.318 (EB), 0.365 (EB.t),0.352 (gwTtest), 0.352 (rMclust). The large estimated proportions (0.3 − 0.4) areprobably because (a) acute leukemias are complex diseases, hence many genesmight be involved; and (b) there are “borderline” genes, the cluster membershipsof which are not clear-cut (see also the Discussion Section). It is interesting that(1) MMD identified more genes as differentially expressed than other methods did;(2) more genes were identified as differentially expressed by methods applied totransformed data (MMD, EB.t, gwTtest, rMclust) than by methods applied to un-transformed data (GG, eGG, LNN, eLNN, EB). Because we do not know whichgenes are truly differentially expressed for the Golub data, it is hard to say whichmethod performed best. In the next section, we will compare these gene selectionmethods via simulation study in which gene cluster memberships are known.

4 SimulationIn this section we consider three simulation scenarios based on three mixture mod-els: MMD, GG, and LNN. The model parameter settings are based on the parameterestimates from the Golub data set using MMD, GG, and LNN models. For each sce-nario, we generate 100 simulated data sets. Each simulated data set contains 3, 200gene profiles for 10 abnormal tissue samples and 10 normal tissue samples.

In scenario I, data sets are generated based on MMD with parameters π1 =0.158, π2 = 0.611, π3 = 0.231, µc1 = 0.715, σ2

c1 = 1.024, ρc1 = 0, µn1 = −0.291,σ2

n1 = 0.660, ρn1 = 0, µ2 = 0.000, σ22 = 0.974, ρ2 = 0, µc3 = −0.654, σ2

c3 =0.573, ρc3 = 0, µn3 = 0.266, σ2

n3 = 0.892, ρc3 = 0.In scenario II, data sets are generated from a mixture of 3-component LNN dis-

tributions. Its hyperparameter values are based on the estimated hyperparametersof Kendziorski, et al’s (2003) LNN model (ln (Yij)|µi ∼N (µi, v

2), µi ∼N (µ0, τ20 ),

Yij is the expression level of the i-th gene for the j-th tissue sample) for the Golubdata. The estimates are µ0 = 0.86, v2 = 0.10, τ 2

0 = 0.06, p = 0.32, where p is theestimated mixing proportion for differentially expressed genes. The corresponding

8




marginal mean, variance, and correlations are µ = 0.86, σ2 = 0.16, and ρ = 0.375(see Appendix F). For the three-component LNN model, we set marginal variancesat 0.16, within-tissue-type marginal correlation at 0.375 for all three components,and marginal means at µc1 = 1.26, µn1 = 0.66, µ2 = 0.86, µc3 = 0.46, µn3 = 1.06.

In scenario III, data sets are generated from a mixture of three-component GGdistributions. Its hyperparameter values are based on the estimated hyperparame-ters of Newton et al’s (2001) GG model (Yij|τ−1

i ∼ Γ(α, τ−1i ), τi ∼ Γ(ξ, 1/ν), α’s

and ξ’s are the shape parameters, and τ ’s and ν’s are the rate parameters of theGamma distributions) for the Golub data. The estimates are α = 17.91, ξ = 10.53,ν = 1.36, p = 0.33, where p is the estimated mixing proportion for differentiallyexpressed genes. The corresponding marginal mean, variance, and correlations areµ = 2.56, σ2 = 1.17, and ρ = 0.65 (see Appendix F). For the three-component GGmodel, we set marginal variances at 1.17, within-tissue-type marginal correlationat 0.65 for all three components, and marginal means at µc1 = 3.56, µn1 = 1.94,µ2 = 2.56, µc3 = 1.56, µn3 = 3.18. For both scenarios II and III, we set π1 = 0.165,π2 = 0.67, π3 = 0.165, and µc1 − µn1 = µn3 − µc3 = 1.5σ.

Because we know the true gene membership for simulated data sets, we estimateerror rates by directly comparing the true gene membership with the gene member-ship obtained by gene selection methods. For example, FNR is estimated by theratio of the number of unselected genes among differentially expressed genes tothe total number of differentially expressed genes for a simulated data set. We alsomeasure the degree of agreement between the true gene-cluster membership andthe gene-cluster membership estimated by a gene selection method based on fiveagreement indices investigated in Milligan (1986): Rand index, Hubert and Ara-bie’s adjusted Rand index (HA), Morey and Agresti’s adjusted Rand index (MA),Fowlkes and Mallows’s index (FM), and Jaccard index, where HA and MA takechance agreement into account. For perfect agreement, the values of these fiveagreement indices are equal to one. We estimate the error rates and agreementindices by the averaged error rates over 100 simulated data sets in each scenario.Wald-type 95% confidence interval (CI) can also be constructed based on error ratesfor the 100 simulated data sets for each scenario.

Since data sets in Scenario I will be directly generated from the MMD model,we do not perform Box-Cox transformation and gene-profile scaling before apply-ing MMD, gwTtest, and rMclust for data sets in Scenario I. When applyingEB and EB.t, cutoffs like 0.5, 0.6, 0.7, 0.8, 0.9, 0.95 for posterior probability areused. When applying GG, eGG, LNN, and eLNN, FDR cutoffs like 0.01, 0.05, 0.1,0.15 and 0.2 are used.

The estimated error rates, agreement indices, and their Wald-type 95% CIs aresummarized in Figures 2 -7. For all three scenarios, MMD has the highest valuesfor the estimated agreement indices. Although its 95% CIs are usually wider than

9



https://www.researchgate.net/publication/243769591_A_Study_Of_The_Comparability_Of_External_Criteria_For_Hierarchical_Cluster-Analysis?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

those of other gene selection methods, in most of cases, they do not overlap withthose of other methods. In terms of error rates, MMD has small (< 0.2) estimatedFDR, FNDR, FPR, and FNR, while other methods have at least one estimated errorrates over 0.2 for at least one scenario.

Like MMD, the performance of gwTtest, EB, and EB.t are quite stableacross the three scenarios. The performance of gwTtest is close to that of MMDfor scenario I, but is worse than EB and EB.t for scenarios II and III. The perfor-mances of EB and EB.t depend on cutoff values. As the cutoff increases, the esti-mated agreement indices, FNDR, and FNR of EB and EB.t increase for all threescenarios, while the estimated FDR and FPR decrease, except that the estimatedagreement indices first increase and then decrease in scenario I. The similarity inperformance of EB and EB.t indicates that the empirical Bayesian method is quiterobust to the departure from the normality assumption.

For data sets generated in scenario I, rMclust performs well. However, fordata sets generated in scenarios II and III, it performs poorly. This is probablybecause Mclust is designed to find patterns in gene profiles. Additionally, the pat-terns detected do not necessarily match clusters of differentially expressed genesand clusters of non-differentially expressed genes (see also the Discussion section).

Like EB and EB.t, the performance of GG, eGG, LNN, and eLNN depend oncutoff values. For scenario I, GG, eGG, LNN, and eLNN with FDR cutoff ≥ 0.1identified almost all genes as differentially expressed. Even for FDR cutoffs 0.01and 0.05, their performance was worse than other methods. For scenarios II and III,the agreement indices, FNDR, and FNR for GG, eGG, LNN, and eLNN decreaseas cutoff increases, while FDR and FNR increase as cutoff increases, except thatRand, FM, and Jaccard of eGG for scenario II first decrease, then increase. Forscenarios II and III, the performances of GG, eGG, LNN, and eLNN with cutoff0.01 are close to that of MMD and are better than those of gwTtest, rMclust,EB, and EB.t.

It is important to evaluate the proposed method when the inter-gene varianceincreases (τ 2

0 in 3-component LNN model or ν in 3-component GG model). Wetried two different τ0 values (0.1 and 1) for the three-component LNN model.The marginal correlations are 0.5 and 0.91, respectively. We kept the marginalmeans unchanged. We also tried two different ν values (10 and 100) for the three-component GG model. The marginal variances changed to 63.44 and 6343.732,respectively. The marginal correlations were unchanged. To make sure µc1−µn1 =µn3 − µc3 = 1.5σ, we set µc1 = 24, µn1 = 12.05, µ2 = 18.79, µc3 = 12.82,µn3 = 24.77 for ν = 10, and µc1 = 247, µn1 = 127.53, µ2 = 187.93, µc3 = 128.20,µn3 = 247.67 for ν = 100. The patterns of estimated error rates and agreementindices (not shown) are the same as those in Figure 2 -7.

In general, the proposed MMD consistently performed best for the simulated

10



FDR FNDR FPR FNR

data generated from MMD

Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4 MMD

gwTtestrMclustGG(0.01)GG(0.05)GG(0.1)GG(0.15)GG(0.2)

Rand HA MA FM Jaccard


Deg

ree

of a

gree

men

t0.

00.

20.

40.

60.

81.

01.

21.

4 MMDgwTtestrMclustGG(0.01)GG(0.05)GG(0.1)GG(0.15)GG(0.2)

FDR FNDR FPR FNR

data generated from mixture of 3−component LNN distributions

Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

MMDgwTtestrMclustGG(0.01)GG(0.05)GG(0.1)GG(0.15)GG(0.2)



Deg

ree

of a

gree

men

t0.

00.

51.

01.

5 MMDgwTtestrMclustGG(0.01)GG(0.05)GG(0.1)GG(0.15)GG(0.2)

FDR FNDR FPR FNR

data generated from mixture of 3−component GG distributions

Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5


Figure 2: Barplots for comparing MMD, gwTtest, rMclust, and GG via esti-mated error rates (left panel) and estimated agreement indices (right panel). Verticalline segments indicate Wald-type 95% CIs. The smaller the error rates and the largerthe agreement indices, the better the gene selection method.

11



FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4 MMD

gwTtestrMclusteGG(0.01)eGG(0.05)eGG(0.1)eGG(0.15)eGG(0.2)



Deg

ree

of a

gree

men

t0.

00.

20.

40.

60.

81.

01.

21.

4 MMDgwTtestrMclusteGG(0.01)eGG(0.05)eGG(0.1)eGG(0.15)eGG(0.2)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

MMDgwTtestrMclusteGG(0.01)eGG(0.05)eGG(0.1)eGG(0.15)eGG(0.2)



Deg

ree

of a

gree

men

t0.

00.

51.

01.

5 MMDgwTtestrMclusteGG(0.01)eGG(0.05)eGG(0.1)eGG(0.15)eGG(0.2)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5


Figure 3: Barplots for comparing MMD, gwTtest, rMclust, and eGG via esti-mated error rates (left panel) and estimated agreement indices (right panel). Verticalline segments indicate Wald-type 95% CIs. The smaller the error rates and the largerthe agreement indices, the better the gene selection method.

12



FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4 MMD

gwTtestrMclustLNN(0.01)LNN(0.05)LNN(0.1)LNN(0.15)LNN(0.2)



Deg

ree

of a

gree

men

t0.

00.

20.

40.

60.

81.

01.

21.

4 MMDgwTtestrMclustLNN(0.01)LNN(0.05)LNN(0.1)LNN(0.15)LNN(0.2)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

MMDgwTtestrMclustLNN(0.01)LNN(0.05)LNN(0.1)LNN(0.15)LNN(0.2)



Deg

ree

of a

gree

men

t0.

00.

51.

01.

5 MMDgwTtestrMclustLNN(0.01)LNN(0.05)LNN(0.1)LNN(0.15)LNN(0.2)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5


Figure 4: Barplots for comparing MMD, gwTtest, rMclust, and LNN via esti-mated error rates (left panel) and estimated agreement indices (right panel). Verticalline segments indicate Wald-type 95% CIs. The smaller the error rates and the largerthe agreement indices, the better the gene selection method.

13



FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4 MMD

gwTtestrMclusteLNN(0.01)eLNN(0.05)eLNN(0.1)eLNN(0.15)eLNN(0.2)



Deg

ree

of a

gree

men

t0.

00.

20.

40.

60.

81.

01.

21.

4 MMDgwTtestrMclusteLNN(0.01)eLNN(0.05)eLNN(0.1)eLNN(0.15)eLNN(0.2)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

MMDgwTtestrMclusteLNN(0.01)eLNN(0.05)eLNN(0.1)eLNN(0.15)eLNN(0.2)



Deg

ree

of a

gree

men

t0.

00.

51.

01.

5 MMDgwTtestrMclusteLNN(0.01)eLNN(0.05)eLNN(0.1)eLNN(0.15)eLNN(0.2)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5


Figure 5: Barplots for comparing MMD, gwTtest, rMclust, and eLNN via esti-mated error rates (left panel) and estimated agreement indices (right panel). Verticalline segments indicate Wald-type 95% CIs. The smaller the error rates and the largerthe agreement indices, the better the gene selection method.

14



FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4 MMD

gwTtestrMclustEB(0.5)EB(0.6)EB(0.7)EB(0.8)EB(0.9)EB(0.95)



Deg

ree

of a

gree

men

t0.

00.

51.

01.

5

MMDgwTtestrMclustEB(0.5)EB(0.6)EB(0.7)EB(0.8)EB(0.9)EB(0.95)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5MMDgwTtestrMclustEB(0.5)EB(0.6)EB(0.7)EB(0.8)EB(0.9)EB(0.95)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5


Figure 6: Barplots for comparing MMD, gwTtest, rMclust, and EB via esti-mated error rates (left panel) and estimated agreement indices (right panel). Verticalline segments indicate Wald-type 95% CIs. The smaller the error rates and the largerthe agreement indices, the better the gene selection method.

15



FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4 MMD

gwTtestrMclustEB.t(0.5)EB.t(0.6)EB.t(0.7)EB.t(0.8)EB.t(0.9)EB.t(0.95)



Deg

ree

of a

gree

men

t0.

00.

51.

01.

5

MMDgwTtestrMclustEB.t(0.5)EB.t(0.6)EB.t(0.7)EB.t(0.8)EB.t(0.9)EB.t(0.95)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5MMDgwTtestrMclustEB.t(0.5)EB.t(0.6)EB.t(0.7)EB.t(0.8)EB.t(0.9)EB.t(0.95)

FDR FNDR FPR FNR


Err

or r

ates

0.0

0.2

0.4

0.6

0.8

1.0

1.2




Deg

ree

of a

gree

men

t0.

00.

51.

01.

52.

02.

5


Figure 7: Barplots for comparing MMD, gwTtest, rMclust, and EB.t via esti-mated error rates (left panel) and estimated agreement indices (right panel). Verticalline segments indicate Wald-type 95% CIs. The smaller the error rates and the largerthe agreement indices, the better the gene selection method.

16



data sets in these three scenarios, because (1) it has the highest estimated agreementindices for all three scenarios; and (2) its estimated error rates are smaller than 0.2for all three scenarios, while the other methods have at least one estimated errorrate over 0.2 for at least one scenario.

5 DiscussionWe propose a new mixture model (MMD) designed to directly characterize themarginal distributions of gene profiles, grouping genes into three clusters: genesover-expressed in abnormal tissues, genes non-differentially expressed, and genesunder-expressed in abnormal tissues. Combined with appropriate data transforma-tion, MMD can be used to approximate many mixture models imposed on microar-ray data. MMD makes efficient use of the structural information of microarray dataand has interpretable model parameters. MMD does not involve multiple testing,because no hypothesis testing is performed. MMD performs well on the publiclyavailable Golub microarray data set and consistently has the best performance com-pared to other gene selection methods we considered for the simulated data setsgenerated from three different mixture models in terms of the five agreement in-dices (Rand, HA, MA, FM, Jaccard).

Choosing an appropriate method of data transformation is important to the suc-cess of MMD. Hence we use a combination of the Box-Cox transformation and thegene-profile scaling if the histogram of all expression levels for either type of tis-sue samples looks skewed. For the real data set and simulated data sets studied inthis article, this combined transformation works well. This confirms Yeung et al.’s(2001) conclusion that suitably chosen transformations seem to result in reasonablefits of microarray data by multivariate Normal distributions. Further research is re-quired to investigate when the combined transformation can work well and whatalternative options we have when it does not work well. For example, we expectthat MMD with the combined transformation might not work well for heavy-taileddata and/or data containing outliers. In these cases, data sharpening techniques(e.g., Choi and Hall, 1999; Wang et al. 2007) may be used. For example, the eGGand eLNN models proposed by Lo and Gottardo (2007) relaxed the assumption of aconstant coefficient of variation across genes required by the GG and LNN models,by imposing a prior distribution to the rate parameter of GG model and the varianceof LNN model. Hence we expect more variability for the marginal distributionsof eGG and eLNN models. In fact, it can be shown that the marginal kurtosis ofeLNN model is 3/(α − 2), where a random expression level Yij in eLNN modelhas the following hierarchical structure: Yij|

(µi, τ

−1i

) ∼ N(µi, τ

−1i

), µi|τ−1

i ∼N

(µ0, kτ−1

i

), τi ∼ Γ(α, 1/β). If α is close to 2, then the marginal kurtosis will be

17




https://www.researchgate.net/publication/4816569_CLUES_A_non-parametric_clustering_method_based_on_local_shrinking?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/245837973_Data_sharpening_as_a_prelude_to_density_estimation?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

very large, hence the tail of the marginal distribution will be heavy. In this case, wemight apply data sharpening techniques before applying MMD.

Model (1) requires the assumption that gene profiles are marginally independentof each other, as assumed by other similar models (e.g., Efron et al., 2001; Loand Gottardo, 2007). However, gene profiles may be marginally correlated in realsituations, as many genes may function together in the same pathway. We willinvestigate this issue in future research.

We expect that in some cases, classification of some genes will not be straight-forward. The magnitude of separation between any two of the three clusters can beevaluated by separation indices (e.g., Qiu and Joe, 2006). Moreover, based on theposterior probabilities produced by MMD, it might be useful to divide the genesfurther into 5 or 6 clusters such as: a) strongly overexpressed, b) strongly under-expressed, c) borderline overexpressed, d) borderline underexpressed, e) not ex-pressed, and f) borderline not expressed.

Finally, we would like to point out that MMD is different from typical geneclustering methods such as Mclust (Yeung et al. 2001), in that MMD aims to detectthree specific gene clusters (genes over-expressed in abnormal tissue samples, genesnon-differentially expressed, genes under-expressed in abnormal tissue samples),while typical gene clustering aims to detect any possible patterns of clustering inthe data. The number of clusters is usually unknown and needs to be estimatedin typical gene clustering. Even if we specify the number of clusters as three fortypical gene clustering, the patterns detected do not necessarily match the clustersof differentially expressed genes and non-differentially expressed genes. Moreover,the gene clusters detected by typical gene clustering methods do not necessarilyhave the same structures for the marginal mean vectors and covariance matrices asthose shown in Formulae 2 and 3.

Appendix

A Structures of marginal mean vectors and covari-ance matrices for Bayesian hierarchical models

We first use the eLNN model (Lo and Gottardo, 2007) to illustrate the marginalmean-vector and covariance-matrix structures of the Bayesian hierarchical models.

Let Y g = (Yg,1, Yg,2, . . ., Yg,mc , Yg,mc+1, Yg,mc+2, . . ., Yg,mc+mn)T , a m × 1vector, be the gene profile for the g-th gene over m tissue samples (m = mc + mn,where mc is the number of abnormal tissue samples and mn normal tissue samples).Without loss of generality, we assume that the first mc tissue samples are abnormaland that the remaining mn tissue samples are normal.

18



https://www.researchgate.net/publication/11679832_Model-Based_Clustering_and_Data_Transformations_for_Gene_Expression_Data?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=




https://www.researchgate.net/publication/4729451_Separation_index_and_partial_membership_for_clustering?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=


The eLNN model assumes that 100π% percent genes are differentially expressedand 100(1−π)% percent genes are non-differentially expressed. For the g-th differ-entially expressed gene, its profiles can be characterized by the following Bayesianhierarchical model:

ln (Yg,`) |µg,c, τ−1g,c ∼ N

(µg,c, τ

−1g,c

), ` = 1, . . . , mc

µg,c|τ−1g,c ∼ N (µ0, kτ−1

g,c )

τg,c ∼ Gamma(α, 1/β)

ln (Yg,`′) |µg,n, τ−1g,n ∼ N

(µg,n, τ−1

g,n

), `′ = mc + 1, . . . , mc + mn

µg,n|τ−1g,n ∼ N (µ0, kτ−1

g,n)

τg,n ∼ Gamma(α, 1/β),

(8)

where α is the shape parameter and β is the rate parameter, or equivalently, 1/β isthe scale parameter. For the g-th non-differentially expressed gene, its profiles canbe characterized by the following Bayesian hierarchical model:

ln (Yg,`) |µg, τ−1g ∼ N

(µg, τ

−1g

), ` = 1, . . . , mc + mn

µg|τ−1g ∼ N (µ0, kτ−1

g )

τg ∼ Gamma(α, 1/β)

(9)

Here we assume that, given higher-level hyperparameters, the current level param-eters or data are conditionally independent.

Based on the smoothing formulae

E (Y ) =E [E (Y |θ)],Var (Y ) =E [Var (Y |θ)] + Var [E (Y |θ)] ,

Cov (Y1, Y2) =E [Cov (Y1, Y2|θ)] + Cov [E (Y1|θ), E (Y2|θ)] ,(10)

where Y , Y1, Y2, and θ are random variables, we can get the marginal mean and vari-ance of ln (Yg,`), and get covariance and correlation between ln (Yg,`) and ln (Yg,`′),` 6= `′ within each tissue type:

E (ln (Yg,`)) =µ0,

Var (ln (Yg,`)) =(k + 1)β

α− 1,

Cov (ln (Yg,`), ln (Yg,`′)) =kβ

α− 1,

cor(ln (Yg,`), ln (Yg,`′)) =k

k + 1, ` 6= `′.

(11)

19



For non-differentially expressed genes, the above formulae for marginal covarianceand correlation are still true for two tissue samples of different tissue types. We canshow (see below) that for differentially expressed genes, the marginal covarianceand correlation across tissue types are both zero.

Hence the marginal mean vector and covariance matrix of the ln transformedprofile of the g-th differentially expressed gene are

E (ln (Y g)) =

(µ01mc

µ01mn

), Cov (ln (Y g)) =

(σ2

0Rc 00 σ2

0Rn

)(12)

where

σ20 =Var (ln (Yg,`)) =

(k + 1)β

α− 1,

Rc =(1− ρc)

[Imc +

ρc

(1− ρc)1mc1

Tmc

],

Rn =(1− ρn)

[Imn +

ρn

(1− ρn)1mn1

Tmn

],

ρc =cor(ln (Yg,`), ln (Yg,`′)) =k

k + 1, `, `′ = 1, . . . , mc, ` 6= `′

ρn =cor(ln (Yg,`), ln (Yg,`′)) =k

k + 1, `, `′ = mc + 1, . . . , mc + mn, ` 6= `′.

(13)The notation 1mc represents a mc×1 vector with all elements being one. Imc is themc ×mc identity matrix.

The marginal mean vector and covariance matrix of the ln transformed profileof the g-th non-differentially expressed gene are:

E (ln (Y g)) = µ01mc+mn , Cov (ln (Y g)) = σ20R (14)

where

R =(1− ρ)

[Imc+mn +

ρ

(1− ρ)1mc+mn1

Tmc+mn

],

ρ =cor(ln (Yg,`), ln (Yg,`′)) =k

k + 1, `, `′ = 1, . . . , mc + mn, ` 6= `′.

(15)

Now we show results for arbitrary 3-level hierarchical models. The results canbe extended to models with more than three levels.

Without loss of generality, let yij be the expression level of the i-thnon-differentially expressed gene for the j-the abnormal tissue sample. Suppose

20



the hierarchical distribution of yij is yij|δ1i ∼ f1(y|δ1i, η1), δ1i|δ2i ∼ f2(δ|δ2i, η2),δ2i|η0 ∼ f2(δ|η0) where η0, η1, and η2 are the vectors of fixed hyperparameters.

Denote h1(δ1i, η1) = E (yij|δ1i, η1) and h2(δ1i, η2) = Var (yij|δ1i, η2). Basedon the smoothing formulae (10) we can get E (yij) = E (h1(δ1i, η1)) and Var (yij)= Var [h1(δ1i, η1)] + E [h2(δ1i, η2)]. That is, the marginal expectation E (yij) andthe marginal variance Var (yij) do not depend on the subscript j, which indicatesthe j-th tissue sample. Similarly, we can show that the covariance Cov (yij, yik) isthe same for all j 6= k. In fact, Cov (yij, yik) = Cov [h1(δ1i, η1), h1(δ1i, η1)]. Weassume that yij and yik are conditionally independent. By iteratively applying thesmoothing formulae, one can further show that E (yij), Var (yij), and Cov (yij, yik)depend only on the hyperparameters η0, η1, η2 and do not depend on subscript iand j.

Now we want to show that for differentially expressed genes, between-tissue-type marginal covariance Cov (yij,c, yik,n) = 0, where yij,c is the expression level ofthe j-th abnormal tissue sample for gene i, and yik,n is the expression level of thek-th normal tissue sample for gene i. Denote hc(δ1i,c, η1,n) = E

(yij,c|δ1i,c, η1,c

),

hn(δ1i,n, η1,n) = E(yik,n|δ1i,n, η1,n

), gc(δ2i,c, η1,c, η2,c) =

E(hc(δ1i,c, η1,c)|δ2i,c, η2,c

), gn(δ2i,n, η1,n, η2,n) = E

(hn(δ1i,n, η1,n)|δ2i,n, η2,n

),

fc(η0,c, η1,c, η2,c) = E(gc(δ2i,c, η1,c, η2,c)|η0,c

), fn(η0,n, η1,n, η2,n) =

E(gn(δ2i,n, η1,n, η2,n)|η0,n

). Then

Cov (yij,c, yik,n)

=Cov[E

(yij,c|δ1i,c, η1,c

), E

(yik,n|δ1i,n, η1,n

)]

+ E[Cov

(yij,c, yik,n|δ1i,c, η1,c, δ1i,n, η1,n

)]

=Cov[hc(δ1i,c, η1,n), hn(δ1i,n, η1,n)

]+ 0

=Cov[E

(hc(δ1i,c, η1,c)|δ2i,c, η2,c

), E

(hn(δ1i,n, η1,n)|δ2i,n, η2,n

)]

=Cov[E

(gc(δ2i,c, η1,c, η2,c)|η0,c

), E

(gn(δ2i,n, η1,n, η2,n)|η0,n

)]

+ E[Cov

(gc(δ2i,c, η1,c, η2,c), gn(δ2i,n, η1,n, η2,n)|η0,n

)]

=Cov[fc(η0,c, η1,c, η2,c), fn(η0,n, η1,n, η2,n)

]+ 0

=0.

(16)

The last step is because the hyperparameters η0, η1, η2 are fixed, not random. Herewe assume that given higher-level hyperparameters, current level parameters areconditionally independent.

21



B Box-Cox transformationThe Box-Cox transformation is defined as y = [xλ − 1]/λ if λ 6= 0; y = ln (x) ifλ = 0, where λ is an unknown parameter. For a given dimension ` (tissue sample),we choose the λ` maximizing the correlation between the quantiles of the Box-Cox transformed gene expression levels and the theoretical quantiles of a Normaldistribution with the same mean and variance as that of Box-Cox transformed geneexpression levels. We then apply Box-Cox transformation with λ =

∑m`=1 λ`/m,

where m is the total number of tissue samples, to all gene expression levels.

C Parameter estimation via the EM algorithmThe fully categorized data can be represented as (Section 4.3, Titterington et al.,1995) {yi, i = 1, . . . , p} = {(xT

i , zTi )T : i = 1, . . . , p}, where p is the number

of genes, xi is a m × 1 vector, m = mc + mn, mc is the number of abnormaltissue samples, mn is the number of normal tissue samples, zi = (zi1, zi2, zi3)

T

and zij = 1 if xi is in the j-th gene cluster, zij = 0 otherwise. The likelihoodcorresponding to (y1, . . . , yp) can then be written in the form

g(y1, . . . , yp|Ψ) =

p∏i=1

3∏j=1

πzij

j fj(xi|θj)zij (17)

with natural logarithm

`0(Ψ) =

p∑i=1

zTi V (π) +

p∑i=1

zTi U i(θ), (18)

whereV (π) = (ln (π1), ln (π2), ln (π3))

T ,

U i(θ) = (ln (f1(xi|θ1)) , ln (f2(xi|θ2)) , ln (f3(xi|θ3)))T ,

θ1 =(µc1 , σ2c1

, ρc1 , µn1 , σ2n1

, ρn1),

θ2 =(µ2, σ22, ρ2),

θ3 =(µc3 , σ2c3

, ρc3 , µn3 , σ2n3

, ρn3),

Ψ =(π1, π2, π3, µc1 , σ2c1

, ρc1 , µn1 , σ2n1

, ρn1 , µ2, σ22, ρ2,

µc3 , σ2c3

, ρc3 , µn3 , σ2n3

, ρn3).

(19)

The EM algorithm generates, from some initial approximation, Ψ(0), a sequence{Ψ(t)} of estimates. Each iteration consists of the following double steps:

22



E step: Evaluate Q(Ψ,Ψ(t)

)= E

[ln (g(y|Ψ)) |x,Ψ(t)

]

M step: Find Ψ = Ψ(t+1) to maximize Q(Ψ,Ψ(t)

), with constraint π1+π2+π3 =

1.

To maximize Q(Ψ,Ψ(t)

), with constraint π1 + π2 + π3 = 1, we use the Lagrange

method. Define

h(Ψ) = Q(Ψ,Ψ(t)

)+ λ(π1 + π2 + π3 − 1). (20)

We can obtain

π(t+1)j =

p∑i=1

wij

(Ψ(t)

)/p, (21)

wherewij

(Ψ(t)

)= π

(t)j fj(xi|θ(t)

j )/[f(xi|θ(t)

1 , θ(t)2 , θ

(t)3 )

]. (22)

In the following, we derive the ln-likelihood for non-differentially expressed geneprofiles. The similar derivation can be applied to gene profiles in the other twogene-profile clusters.

Suppose a m×1 random vector X has multivariate Normal distribution N (µ, Σ),where µ = µ2 1m, Σ = σ2

2 (1−ρ2) R0, R0 = Im + ρ2

(1−ρ2)1m1T

m, 1m is a m×1 vectorwith all elements equal to one. Denote a(x, µ2) = (x− µ21m). Based on the re-sults from the matrix cookbook1: (A+BC)−1 = A−1 −A−1B

[I + CA−1B

]−1

C A−1 and∣∣I + uvT

∣∣ = 1 + uT v, the ln-density function is

ln [f2(x)] =− m

2ln (2π)− m

2ln (σ2

2)−1

2{(m− 1)ln [(1− ρ2)]

+ln [1 + (m− 1)ρ2]}

−[a(x, µ2)]

T [a(x, µ2)]− ρ2

[1+(m−1)ρ2]

[a(x, µ2)

T1m

]2

2[σ22(1− ρ2)]

.

(23)

In the M-step of the E-M algorithm, we need to calculate∂h(Ψ)/∂µ2 =

∑pi=1 wi2

(Ψ(t)

)∂ln [f2(xi|θ2)] / ∂µ2,

∂h(Ψ)/∂σ22 =

∑pi=1 wi2

(Ψ(t)

)∂ln [f2(xi|θ2)]/∂σ2

2 ,

∂h(Ψ)/∂ρ2 =∑p

i=1 wi2

(Ψ(t)

)∂ln [f2(xi|θ2)]/∂ρ2.

We can obtain µ2 =∑p

i=1 wi2

(xT

i 1n

)/ (n

∑pi=1 wi2), σ2

2 = d/(nb),

1http://matrixcookbook.com/

23



ρ2 =(e− d

)/[(n− 1)d

], where d =

∑pi=1 wi2 [a(xi, µ2)]

T [a(xi, µ2)],

e =∑p

i=1 wi2

[[a(xi, µ2)]

T 1n

]2

, b =∑p

i=1 wi2, a(xi, µ2) = xi − µ21n.

D The two sets of initial values of model parametersfor the EM algorithm

To get the first initial set of model parameters, we first perform a two-sample t-testfor each gene, then estimate µ2, σ2

2 , and ρ2 based on the genes that have p-valuesgreater than 0.05. We estimate µc1 , σ2

c1, ρc1 , µn1 , σ2

n1ρn1 based on the genes that

have p-values less than 0.05 and positive test statistics. We estimate µc3 , σ2c3

, ρc3 ,µn3 , σ2

n3ρn3 based on the genes that have p-values less than 0.05 and negative

test statistics. To get the second initial set of model parameters, we first get athree-cluster partition of genes via Mclust, then calculate the likelihood functionfor each permutation of the cluster labels: 1, 2, 3. The permutation having thelargest value of the likelihood function is chosen as the final three-cluster partitionof genes, based on which the initial set of model parameters are estimated. If nopermutation produces the partition such that µc1 > µn1 and µc3 < µn3 , the initialset of parameters obtained by two-sample t-test will be used.

E The stopping criterion of the EM algorithmIn MMD, we stop EM if

18∑i=1

|ψ(t+1)i − ψ

(t)i | < ε (24)

or the iteration number exceeds the maximum allowed iteration number ITMAX .For the analyses of the real data set and simulated data sets in this article, we setε = 10−3 and ITMAX = 100.

In the analyses of both real and simulated data in this article, the total numberof iterations were less than ITMAX = 100. For the real data set, the total numberof iterations corresponding to the two initial sets of parameters are both 10. Table2 lists the median and range of the total number of iterations in the analysis of thedata sets used in this article.

Titterington et al. (pages 88-89, Section 4.3.2, 1995) pointed out that (1) un-conditional convergence of the parameter estimates via EM to maximum likelihoodestimate is not guaranteed. This is true also for other alternative algorithms suchas the Newton-Raphson method and the method of scoring; (2) the convergence ofEM is often “excruciatingly slow” compared to alternative algorithms, such as the

24



Table 2: Median and range of total number of iterations for simulated data setsdata/initial method t-test MclustMMD 14[7, 28] 10[5, 27]GG 9[4, 16] 11[7, 23]LNN 12[5, 22] 19[6, 50]

Newton-Raphson method. However, EM is usually simple to apply and satisfies theappealing monotonic property (i.e., the likelihoods of interest increase monotoni-cally as the iteration number increases). These are the main reasons that we chosethe EM algorithm to obtain parameter estimates.

It is possible that the values of f(xi|Ψ(t)) are very close to zero for some xi.To increase the stability in numerical computation when we calculate wij

(Ψ(t)

)in

Formula (22), we rewrite Formula (22) as

wij

(Ψ(t)

)=

[π

(t)j fj(xi|θ(t)

j )]/[π

(t)j0

fj0(xi|θ(t)j0

)]

[f(xi|θ(t)

1 , θ(t)2 , θ

(t)3 )

]/[π

(t)j0

fj0(xi|θ(t)j0

)] , (25)

wherej0 = argminj=1,2,3

(xi − µj

)TΣ−1

j

(xi − µj

). (26)

F The marginal moments of LNN and GGThe LNN model is defined as ln (Yij)|µi ∼ N (µi, v

2), µi ∼ N (µ0, τ20 ). The GG

model is defined as Yij|τ−1i ∼ Γ(α, τ−1

i ) τi ∼ Γ(ξ, 1/ν), where ξ and ν are the shapeand rate parameters of the gamma distribution, respectively. For the LNN model,ln (Yij) |µi is conditionally independent; for the GG model, Yij|τ−1

i is conditionallyindependent.

Iteratively applying the smoothing formulae in (10), we can get E (ln (Yij)) =µ0, Var (ln (Yij)) = v2 + τ 2

0 , Cov (ln (Yij), ln (Yik)) = τ 20 , cor(ln (Yij), ln (Yik)) =

τ 20 /(v2 + τ 2

0 ) for LNN model, and E (Yij) = αν/(ξ − 1),Var (Yij) = αν2(α + ξ − 1)/[(ξ − 1)2(ξ − 2)],Cov (Yij, Yik) = α2ν2 / [(ξ − 1)2(ξ − 2)], cor(Yij, Yik) = α/(α + ξ − 1) for GGmodel.

25



Statistical Society. Series B, 26:211–246, 1964.

Broet, P., Richardson, S., and Radvanyi, F. Bayesian hierarchical model for iden-tifying changes in gene expression form microarray experiments. Journal ofComputational Biology, 9:671–683, 2002.

Broet, P., Lewin, A., Richardson, S., Dalmasso, C., and Magdelenat, H. A mixturemodel-based strategy for selecting sets of genes in multiclass response microar-ray experiments. Bioinformatics, 20:2562–2571, 2004.

Choi, E. and Hall, P. Data sharpening as a prelude to density estimation. Biometrika,86(4):941–947, 1999.

Dempster, A., Laird, N., and Rubin, D. Maximum likelihood from incomplete datavia the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38,1977.

Do, K.-A., Muller, P., and Tang, F. A bayesian mixture model for differential geneexpression. Journal of the Royal Statistical Society: Series C (Applied Statistics),54:627–644, 2005.

Draper, N.R. and Cox, D.R. On distributions and their transformation to normality.Journal of the Royal Statistical Society. Series B, 31:472–476, 1969.

Dudoit, S., Fridly, J., and Speed, T. Comparison of discrimination methods forthe classification of tumors using gene expression data. Journal of the AmericanStatistical Association, 97(457):77–87, 2002.

Efron, B. Large-scale simultaneous hypothesis testing: The choice of a null hy-pothesis. Journal of the American Statistical Association, 99:96–104, 2004.

Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. Empirical bayes analysisof a microarray experiment. Journal of the American Statistical Association, 96(456):1151–1160, 2001.

Fraley, C. and Raftery, A. E. MCLUST: Software for model-based clustering anddiscriminant analysis. Journal of Classification, 16:297–306, 1999.

ReferencesBox, G.E.P. and Cox, D. R. An analysis of transformations. Journal of Royal

26












https://www.researchgate.net/publication/267169927_On_Distributions_and_Their_Transformations_to_Normality?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/267169927_On_Distributions_and_Their_Transformations_to_Normality?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=











https://www.researchgate.net/publication/224839719_An_Analysis_of_Transformations_with_Discussion?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/224839719_An_Analysis_of_Transformations_with_Discussion?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=






He, W. A spline function approach for detecting differentially expressed genes inmicroarray data analysis. Bioinformatics, 20(17):2954–2963, 2004.

Kendziorski, C., Newton, M., Lan, H., and Gould, M. N. On parametric empiricalbayes methods for comparing multiple groups using replicated gene expressionprofiles. Statistics in Medicine, 22:3899–3914, 2003.

Lee, M.-L.T. Analysis of Microarray Gene Expression Data. Kluwer AcademicPublishers, 2004.

Lee, M.-L.T., Kuo, F.C., Whitmore, G.A., and Sklar, J. Importance of replicationin microarray gene expression studies: Statistical methods and evidence fromrepetitive cdna hybridizations. Proceedings of the National Academy of Sciences,97:9834–9839, 2000.

Lee, M.-L.T., Lu, W., Whitmore, G.A., and Beier, D. Models for microarray geneexpression data. Journal of Biopharmaceutical Statistics, 12:1–19, 2002.

Lo, K. and Gottardo, R. Flexible empirical bayes models for differential gene ex-pression. Bioinformatics, 23:328–335, 2007.

McLachlan, G.J., Do, K.-A., and Ambroise, C. Analyzing Microarray Gene Ex-pression Data. Wiley, Hoboken, N.J., 2004.

McLachlan, G.J., Bean, R.W., and Jones, L.B.-T. A simple implementation of a nor-mal mixture approach to differential gene expression in multiclass microarrays.Bioinformatics, 22:1608–1615, 2006.

Milligan, G. W. and Cooper, M. C. A study of the comparability of external criteriafor hierarchical cluster analysis. Multivariate Behavioral Research, 21:441–458,1986.

Newton, M. A., Kendziorski, C. M., Richmond, C. S., Blattner, F. R., and Tsui, K.W. On differential variability of expression ratios: Improving statistical inferenceabout gene expression change from microarray data. Journal of ComputationalBiology, 8:37–52, 2001.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D.,and Lander, E. S. Molecular classification of cancer: Class discovery and classprediction by gene expression monitoring. Science, 286(15):531–537, October1999.

27





















https://www.researchgate.net/publication/8525860_A_spline_function_approach_for_detecting_differentially_expressed_genes_in_microarray_data_analysis?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/8525860_A_spline_function_approach_for_detecting_differentially_expressed_genes_in_microarray_data_analysis?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/11235417_Models_for_microarray_gene_expression_data?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/11235417_Models_for_microarray_gene_expression_data?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=









Qiu, W.-L. and Joe, H. Separation index and partial membership for clustering.Computational Statistics & Data Analysis, 50:585–603, 2006.

Titterington, D. M., Smith, A.F.M., and Makov, U.E. Statistical Analysis of FiniteMixture Distributions. John Wiley and Sons, Inc., 1995.

Tusher, V.G., Tibshirani, R., and Chu, G. Significance analysis of microarraysapplied to the ionizing radiation response. Proceedings of the National Academyof Sciences of the United States of America, 98:5116–5121, 2001.

Wang, S., Qiu, W.-L., and Zamar, R. H. Clues: A non-parametric clustering methodbased on local shrinking. Computational Statistics & Data Analysis, 52:286–298,2007.

Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., and Ruzzo, W.L. Model-basedclustering and data transformations for gene expression data. Bioinformatics, 17:977–987, 2001.

Pawitan, Y., Michiels, S., Koscielny, S., Gusnanto, A., and Ploner, A. False discov-ery rates, sensitivity, and sample size calculation for microarray studies. Bioin-formatics, 21:3017–3024, 2005.

Pan, W. A comparative review of statistical methods for discovering differentiallyexpressed genes in replicated microarray experiments. Bioinformatics, 18(4):546–554, 2002.

28






https://www.researchgate.net/publication/7896206_False_Discovery_Rate_Sensitivity_and_Sample_Size_for_Microarray_Studies?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=














https://www.researchgate.net/publication/247641695_Statistical_Analysis_of_Finite_Mixture_Distributions_John_Wiley_and_Sons?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

https://www.researchgate.net/publication/247641695_Statistical_Analysis_of_Finite_Mixture_Distributions_John_Wiley_and_Sons?el=1_x_8&enrichId=rgreq-255160bd-d472-4e09-9b86-c31a17470aaa&enrichSource=Y292ZXJQYWdlOzQxOTY5NTM2O0FTOjk5MTU3OTMwOTM4Mzk0QDE0MDA2NTI1MDE4NDQ=

A Marginal Mixture Model for Selecting Differentially Expressed Genes across Two Types of Tissue Samples

Documents