Top Banner
A VARIATIONAL BAYES BETA MIXTURE MODEL FOR FEATURE SELECTION IN DNA METHYLATION STUDIES ZHANYU MA * ,and ANDREW E. TESCHENDORFF ,§,* KTH-Royal Institute of Technology School of Electrical Engineering SE-100 44, Stockholm, Sweden Statistical Genomics Group, Paul O'Gorman Building UCL Cancer Institute, University College London 72 Huntley Street, London WC1E 6BT, United Kingdom [email protected] § a.teschendorff@ucl.ac.uk Received 13 September 2012 Revised 21 November 2012 Accepted 4 January 2013 Published 14 March 2013 An increasing number of studies are using beadarrays to measure DNA methylation on a genome-wide basis. The purpose is to identify novel biomarkers in a wide range of complex genetic diseases including cancer. A common di±culty encountered in these studies is dis- tinguishing true biomarkers from false positives. While statistical methods aimed at improving the feature selection step have been developed for gene expression, relatively few methods have been adapted to DNA methylation data, which is naturally beta-distributed. Here we explore and propose an innovative application of a recently developed variational Bayesian beta-mixture model (VBBMM) to the feature selection problem in the context of DNA methylation data generated from a highly popular beadarray technology. We demonstrate that VBBMM o®ers signi¯cant improvements in inference and feature selection in this type of data compared to an Expectation-Maximization (EM) algorithm, at a signi¯cantly reduced computational cost. We further demonstrate the added value of VBBMM as a feature se- lection and prioritization step in the context of identifying prognostic markers in breast cancer. A variational Bayesian approach to feature selection of DNA methylation pro¯les should thus be of value to any study undergoing large-scale DNA methylation pro¯ling in search of novel biomarkers. Keywords: Feature selection; beta mixture; DNA methylation; variational Bayes. Corresponding author. Journal of Bioinformatics and Computational Biology Vol. 11, No. 4 (2013) 1350005 (19 pages) # . c Imperial College Press DOI: 10.1142/S0219720013500054 1350005-1
19

A variational Bayes beta mixture model for feature selection in DNA methylation studies

Apr 22, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A variational Bayes beta mixture model for feature selection in DNA methylation studies

A VARIATIONAL BAYES BETA MIXTURE MODEL

FOR FEATURE SELECTION IN DNA

METHYLATION STUDIES

ZHANYU MA*,‡ and ANDREW E. TESCHENDORFF†,§,¶

*KTH-Royal Institute of Technology

School of Electrical Engineering

SE-100 44, Stockholm, Sweden†Statistical Genomics Group, Paul O'Gorman Building

UCL Cancer Institute, University College London

72 Huntley Street, London WC1E 6BT, United Kingdom‡[email protected]

§[email protected]

Received 13 September 2012Revised 21 November 2012

Accepted 4 January 2013

Published 14 March 2013

An increasing number of studies are using beadarrays to measure DNA methylation on a

genome-wide basis. The purpose is to identify novel biomarkers in a wide range of complexgenetic diseases including cancer. A common di±culty encountered in these studies is dis-

tinguishing true biomarkers from false positives. While statistical methods aimed at improving

the feature selection step have been developed for gene expression, relatively few methods

have been adapted to DNA methylation data, which is naturally beta-distributed. Here weexplore and propose an innovative application of a recently developed variational Bayesian

beta-mixture model (VBBMM) to the feature selection problem in the context of DNA

methylation data generated from a highly popular beadarray technology. We demonstratethat VBBMM o®ers signi¯cant improvements in inference and feature selection in this type of

data compared to an Expectation-Maximization (EM) algorithm, at a signi¯cantly reduced

computational cost. We further demonstrate the added value of VBBMM as a feature se-

lection and prioritization step in the context of identifying prognostic markers in breastcancer. A variational Bayesian approach to feature selection of DNA methylation pro¯les

should thus be of value to any study undergoing large-scale DNA methylation pro¯ling in

search of novel biomarkers.

Keywords: Feature selection; beta mixture; DNA methylation; variational Bayes.

¶Corresponding author.

Journal of Bioinformatics and Computational BiologyVol. 11, No. 4 (2013) 1350005 (19 pages)

#.c Imperial College Press

DOI: 10.1142/S0219720013500054

1350005-1

Page 2: A variational Bayes beta mixture model for feature selection in DNA methylation studies

1. Introduction

It is clear that there is an urgent need to identify novel biomarkers, e.g. prognostic

markers, for complex genetic diseases like cancer.1 However, identifying reliable

biomarkers from large scale genome-wide molecular pro¯ling studies is a notoriously

di±cult problem.1 From a methods perspective, this problem is known as feature

selection. The main aim of any feature selection procedure is to identify those fea-

tures which are more likely to be truly associated with a phenotype of interest. One of

the key di±culties of feature selection in the genomics context is the high-dimen-

sional nature of the data encompassing typically on the order of 104�106 features

(e.g. genes or single nucleotide polymorphisms (SNPs)), which may generate a sig-

ni¯cant number of false positives.2 Using stringent statistical signi¯cance measures

may also result in unacceptably large false negative rates, specially in the context of

quantitative data such as gene expression or DNA methylation.3 These problems are

further compounded by the presence of confounding factors, which may arti¯cally

in°ate or de°ate statistical signi¯cance levels.2,4 Therefore, statistical approaches

that aim to extract meaningul features while ¯ltering out false positives have re-

ceived considerable attention.5�12 One of the most popular methods in the gene

expression ¯eld has been to ¯lter features based on variance, since the assumption is

that features exhibiting low variability are more likely to represent noise.5 Others

have advocated a semi-supervised approach in which features are ¯rst selected using

a supervised algorithm and then further selected based on an unsupervised dimen-

sional reduction method such as principal component analysis (PCA) or nonnegative

matrix factorization (NMF).6 Subsequently, it was realized that similar improve-

ments in feature selection could be achieved by studying higher-order statistical

moments (e.g. skewness or kurtosis) of the molecular pro¯les (speci¯cally, gene ex-

pression pro¯les).7,8,13 Indeed, novel clinical subtypes and associated biomarkers in

prostate and breast cancer were identi¯ed using these more advanced feature se-

lection methods.13,14 These novel molecular subclasses and biomarkers in prostate

and breast cancer are now well established,15,16 which attests to the power and

potential clinical impact that such feature selection methods can have.

Studying higher-order statistical moments (e.g. kurtosis) of molecular pro¯les (i.e.

the expression pro¯le of a gene across a set of samples, or a CpG methylation pro¯le)

is not equivalent but is similar to the problem of identifying structure in the mo-

lecular pro¯le of a given feature.7 Intuitively, a feature exhibiting a striking bi-

modality (hence a non-Gaussian distribution) may be of more interest than a feature

which exhibits a highly variable but Gaussian pro¯le, specially if the bi-modality is

correlated to a phenotype of interest. Indeed, the bi-modality of such a feature is

more likely to describe genuine biology and to represent a feature that has not been

corrupted by biological noise or technical factors.7 This idea of performing feature

selection by studying the structure of individual molecular pro¯les and its proof-of-

concept has been demonstrated by us in the gene expression context7 using a vari-

ational Bayesian Gaussian Mixture Model.17

Z. Ma & A. E. Teschendor®

1350005-2

Page 3: A variational Bayes beta mixture model for feature selection in DNA methylation studies

Themain purpose of this manuscript is to explore the analogous problem of feature

selection in the context of DNA methylation data. DNA methylation is an epigenetic

mark, a covalent modi¯cation of DNA, which normally happens at CpG dinucleo-

tides, and which plays an important role for cellular di®erentiation processes and in

complex genetic disease.18�24 Indeed, DNA methylation markers have been proposed

as early detection, diagnostic and prognostic markers in a wide range of di®erent

diseases including cancer.23 Catalyzing this increased interest in epigenomics are

signi¯cant advances in beadarray technology that now allow routine measurement of

DNA methylation at over thousands of CpG dinucleotides.25,26 These beadarrays

quantify DNA methylation in terms of a �-value, which represents the relative pro-

portion of methylation at the CpG site, thus taking values between 0 (unmethylated)

and 1 (fully methylated).25 Although some studies have considered using the logit-

transform y ¼ log2�=ð1� �Þ instead,27 owing to its more homoscedastic nature, it was

shown in Zhuang et al.28 that the logit-basis can lead to worse inference as it can

aggravate the e®ects of outliers (i.e. � values close to 0 or 1): from a biological per-

spective an outlier at � ¼ 0:999 is not more interesting than one at � ¼ 0:9, yet on the

logit scale they would be widely separated. Although normalization and clustering

methods designed for beta-valued DNA methylation data have recently been

investigated,29�36 there is still a signi¯cant shortage of feature selection methods.28,37

Thus, the second purpose of this manuscript is to explore the application of a

recently developed Variational Bayes beta-mixture model (VBBMM)38 to DNA

methylation data. To the best of our knowledge, this is the ¯rst application of a

VBBMM model to this type of data. To assess VBBMM on this data, we ¯rst

benchmark its performance against an analogous EM+BIC algorithm.36,39,40 Al-

though the advantages of using a variational Bayesian approach over EM+BIC are

well understood,41,42 it is important to investigate the relative performance of these

methods in novel contexts. To perform the comparison between methods, we focus

on DNA methylation data where e®ect sizes are small so as to provide a more

challenging scenario for the algorithms. Speci¯cally, we use DNA methylation data

from whole blood samples from ovarian cancer patients and age-matched healthy

controls where the di®erences in DNA methylation between cases and controls is

driven by relatively small changes in blood cell type composition as demonstrated by

us previously.43 Clearly, in the opposite extreme case where e®ect sizes are fairly

large, for instance when comparing normal to cancer epithelial tissue, both types of

algorithm are expected to yield similar results. The restriction to small e®ect sizes is

also of particular interest since the evidence so far points towards epidemiological

and disease risk DNA methylation markers of relatively small e®ect sizes.24,23,43,44

This manuscript is organized as follows. In Sec. 2 we describe the DNA meth-

ylation data sets and review the VBBMM. In Sec. 3 we ¯rst compare VBBMM to

EM+BIC in the context of DNA methylation data, and clearly demonstrate the

improved sensitivity and positive predictive value that VBBMM o®ers over EM

+BIC. We subsequently apply VBBMM to the problem of feature selection in

large-scale DNA methylation studies and demonstrate its added value in the

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-3

Page 4: A variational Bayes beta mixture model for feature selection in DNA methylation studies

context of identifying prognostic markers in breast cancer. Section 4 presents our

conclusions.

2. Data and Methods

2.1. The Illumina In¯nium DNAm assay

All DNA methylation data sets have been generated using Illumina's In¯nium

Human Methylation 27k Beadchips25 and have already been presented else-

where.43,44 The Beadchips interrogate the methylation status of approximately

27,000 CpGs. In this work we used the normalized data as described in Refs. 43

and 44. Let i denote the CpG and j the sample. The normalized methylation values

of the CpGs follow an approximate �-valued distribution, with � constrained to

lie between 0 (unmethylated locus) and 1 (methylated). This follows from the de¯-

nition of � as the ratio of methylated to combined intensity values i.e.

�ij ¼Mij

Uij þMij þ e; ð1Þ

where Uij and Mij are the unmethylated and methylated intensity values of the

probe (averaged over bead replicates) and e is a small correction term to regularize

probes of low total signal intensity (i.e. probes with Uij þMij � 0 after background

subtraction). Thus, our data matrices Xij are such that Xij ¼ �ij where �ij is the

normalized methylation value as given above.

2.2. The data

Data Set 1: DNAm of whole blood samples from ovarian cancer patients before

treatment and age-matched healthy controls

We consider a DNAm data set over 25642 CpGs and consisting of 261 whole blood

samples, 113 of these from women with ovarian cancer (cases) and 148 from age-

matched healthy women (controls).43 We previously showed that there are many

CpGs which are di®erentially methylated between cases and controls, and that there

was an enrichment for di®erentially methylated CpGs mapping to markers of lym-

phocytes and granulocytes (a total of 138 CpGs) (Supp.Table S1),45 re°ecting an

increase in the granulocyte to lymphocyte ratio in the presence of ovarian cancer.43

Because the DNA methylation changes re°ect changes in blood cell type composi-

tion, the associated e®ect sizes are small, making it an ideal scenario in which to

evaluate feature selection methods.

Data Set 2: DNAm of breast cancer tissue samples

This Illumina 27k DNAm data set is de¯ned over 24589 CpGs and 113 breast cancer

tissue samples.46 Of the 113 patients, 59 died of the disease or disease-related causes

(overall survival) and 54 remained alive until end of study or were lost to follow-up.

Z. Ma & A. E. Teschendor®

1350005-4

Page 5: A variational Bayes beta mixture model for feature selection in DNA methylation studies

Data Set 3: DNAm of breast cancer tissue samples

An independent Illumina 27k DNAm data set over 27578 CpGs and 103 breast

cancer tissue samples.47 Since survival information was not available for these

samples, we used relapse free survival as a surrogate. Of the 103 patients, relapse

information was available for 82 samples, of which 18 relapsed and 64 did not.

2.3. The variational Bayes Beta Mixture Model (VBBMM)

2.3.1. The variational Bayesian beta-mixture model

Background to the variational Bayes method can be found elsewhere.41,48 Here we

brie°y review the VBBMM, full details of which are described in Ma et al.38 The

probability density function of the beta distribution is

Betaðx;u; vÞ ¼ 1

betaðu; vÞ xu�1ð1� xÞv�1; u; v > 0; ð2Þ

where betaðu; vÞ is the beta function betaðu; vÞ ¼ �ðuÞ�ðvÞ�ðuþvÞ and �ð�Þ is the gamma

function de¯ned as �ðzÞ ¼ R 10tz�1e�tdt. The shape of the beta distribution depends

on two shape parameters u; v. Assuming a mixture model and a set of i.i.d obser-

vation X ¼ fx1; . . . ;xNg, the likelihood is given as

fðX;U;VÞ ¼YNn¼1

fðxn;¦;U;VÞ: ð3Þ

with

fðx;¦;U;VÞ ¼XI

i¼1

�iBetaðx;ui;viÞ ð4Þ

¼XI

i¼1

�i

YLl¼1

Betaðxl;uli; vliÞ; ð5Þ

and where x ¼ fx1; . . . ;xLg, ¦ ¼ f�1; . . . ; �Ig, U ¼ fu1; . . . ;uIg and V ¼fv1; . . . ;vIg. fui;vig denote the parameters vectors of the ith mixture component

and uli; vli are the (scalar) parameters of the beta distribution for element xl.

In order to perform the Bayesian analysis one seeks a conjugate prior for the beta

distribution. It can be shown that the conjugate prior is

fðu; vÞ ¼ 1

Cð�0; �0; �0Þ�ðuþ vÞ�ðuÞ�ðvÞ

� ��0

e��0ðu�1Þe��0ðv�1Þ ð6Þ

where �0, �0, �0 are free positive parameters and Cð�0; �0; �0Þ is a normaliza-

tion factor such thatR 10

R 10fðu; vÞdudv ¼ 1. Indeed this leads to the posterior

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-5

Page 6: A variational Bayes beta mixture model for feature selection in DNA methylation studies

(with Ni.i.d. scalar observations X ¼ fx1; . . . ;xNg)

fðu; vjXÞ ¼ fðXju; vÞfðu; vÞR 10

R 10fðXju; vÞfðu; vÞdudv ð7Þ

¼ 1

Cð�N ; �N ; �NÞ�ðuþ vÞ�ðuÞ�ðvÞ

� ��N

e��N ðu�1Þe��N ðv�1Þ ð8Þ

where �N ¼ �0 þN , �N ¼ �0 �PN

n¼1 lnxn and �N ¼ �0 �PN

n¼1 lnð1� xnÞ.However, this expression is analytically intractable. In Ref. 38 a variational so-

lution was proposed by approximating the conjugate prior as

fðu; vÞ � fðuÞfðvÞ: ð9Þ

where

fðu;�; �Þ ¼ ��

�ð�Þ u��1e��u; fðv; �; �Þ ð10Þ

¼ ��

�ð�Þ v��1e��v: ð11Þ

The same form of approximation then applies to the posterior distribution as

fðu; vjXÞ � fðujXÞfðvjXÞ: ð12Þ

Next, a hierarchical model for Bayesian estimation can be constructed following

the principles of graphical models.49 For each observation xn, let the corre-

sponding zn ¼ ½zn1; . . . ; znI �T be the indication vector with one element equal to 1

and the rest equal to 0. Denoting Z ¼ fz1; . . . ; zNg and assuming the indication

vectors are independent given the mixing coe±cients, the conditional distribution

of Z given ¦ is

fðZj¦Þ ¼YNn¼1

YIi¼1

� znii : ð13Þ

Introducing the Dirichlet distribution as the prior distribution of the mixing

coe±cients, the probability function of ¦ can be written as

fð¦Þ ¼ Dirð�jcÞ ¼ CðcÞYIi¼1

� ci�1i ð14Þ

where CðcÞ ¼ �ðcÞ�ðc1Þ����ðcI Þ and c ¼ PI

i¼1 ci.

Z. Ma & A. E. Teschendor®

1350005-6

Page 7: A variational Bayes beta mixture model for feature selection in DNA methylation studies

Finally, the logarithm of the full joint density function of the data X and all the

i.i.d. latent variables Z ¼ fU;V;¦;Zg is given by38

LðX;ZÞ ¼ ln fðX;Z;U;V;¦Þ ð15Þ

¼ con:þXNn¼1

XI

i¼1

zni ln�i þXL

l¼1

ln�ðuli þ vliÞ�ðuliÞ�ðvliÞ

(ð16Þ

þXL

l¼1

ðuli � 1Þ lnxln þ ðvli � 1Þ lnð1� xlnÞ½ �)

ð17Þ

þXL

l¼1

XI

i¼1

ð�li � 1Þ lnuli � �liuli½ � ð18Þ

þXL

l¼1

XI

i¼1

ð�li � 1Þ ln vli � �livli½ � þXI

i¼1

ðci � 1Þ ln�i: ð19Þ

The speci¯c update rules for the parameters can be found in Ma et al.38

2.3.2. Algorithm comparison

We benchmark the variational Bayesian beta mixture model to an analogous beta

mixture model implementation using the Expectation Maximization (EM) algorithm

and the Bayesian Information Criterion (BIC) for model selection.50

3. Experimental Results

3.1. Improved sensitivity of the variational Bayesian BMM

on DNA methylation data

To assess the VBBMM we benchmarked it to an analogous EMþBIC beta mixture

model. In our ¯rst test we compared the two algorithms in their ability to detect

biological structure in DNA methylation pro¯les (i.e. bi-modality or multi-modality

which correlates with a biological phenotype). To this end we studied DNA meth-

ylation pro¯les of whole blood samples from 113 ovarian cancer cases and 148 age-

matched healthy controls (DataSet1, Methods), focusing on a subset of 138 CpGs

which map to genes marking lymphocytes and granulocytes (Supp.Table S1),43 the

two main cell constituents of whole blood. Since the granulocyte to lymphocyte ratio

is increased in the blood of ovarian cancer patients,48,51,52 the CpGs associated with

these genes should be di®erentially methylated. From the point of view of unsu-

pervised clustering, the DNA methylation pro¯les of each of these CpGs should

exhibit structure, i.e. the optimal model should be one with at least two clusters, with

the clusters correlating with the case/control phenotype. Thus, we ran the VB and

EMþBIC algorithms separately on each of the 138 CpG methylation pro¯les and

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-7

Page 8: A variational Bayes beta mixture model for feature selection in DNA methylation studies

recorded the optimal number of clusters. Using the VB mixture model, 120 of these

138 CpGs (i.e. 87%) exhibited structure, in stark contrast to the EMþBIC model

where only 29 (21%) did (Table 1). All CpGs but three that showed structure under

the EMþBIC model, did so also under the VB approach. In contrast, up to 94 CpGs

only showed structure under the VB model.

Although the selected CpGs should show structure on biological grounds, one still

needs to demonstrate that the clustering structure inferred is of biological relevance,

and in particular that the CpGs identi¯ed to have structure only under the VB model

are more correlated with the phenotype of interest than those identi¯ed under

EMþBIC. Thus, we asked how well the speci¯c clusters, inferred using the two

algorithms, correlated with case/control status. To evaluate the concordance be-

tween the clustering output and a binary phenotype one needs a correlative measure

which can handle clustering solutions of more than two clusters. The adjusted Rand

Index (ARI)53,54 has been used extensively for this purpose (see for e.g. Ref. 55 for the

rationale of using the Rand Index). The ARI can be viewed as a Rand Index cor-

rected for random chance with values further away from 0 re°ecting stronger sta-

tistical signi¯cance.54 The ARI analysis showed that the clusters inferred using

the VB approach were indeed more strongly associated with case/control status

(Fig. 1(a)). A typical example of a DNA methylation pro¯le where the VB algorithm

predicted structure but where EMþBIC did not, con¯rmed that the inferred clusters

correlate signi¯cantly with the phenotype of interest (Fig. 1(b)). Thus, we can

conclude that not only does the VB model identify more structure in DNA meth-

ylation data, thus potentially allowing for improved feature selection, but that

the inferred clusters themselves are more strongly associated with the biological

phenotype.

3.2. The variational Bayesian BMM improves the positive

predictive value

To further compare the algorithms, we adopted a discovery/test set partition

strategy. The data was split into a 50% discovery and 50% test set and features

selected from the discovery set using either EMþBIC or VB. A total of 50 di®erent

discovery/test set partitions were considered. In line with our previous results, the

VB algorithm o®ered substantial improved power of detecting CpGs with clustering

Table 1. For the 138 CpGs in DataSet1, we provide

their distribution in terms of the optimal number ofclusters in their DNA methylation pro¯les as estimated

using the EMþBIC and VB algorithms. The maxi-

mum number of clusters considered was in both cases 6.

Clusters 1 2 3 4 5 6

EMþBIC 109 27 2 0 0 0

VB 18 27 51 33 9 0

Z. Ma & A. E. Teschendor®

1350005-8

Page 9: A variational Bayes beta mixture model for feature selection in DNA methylation studies

structure in their methylation pro¯les (Fig. 2(a)) and the clustering itself was also

more strongly associated with the phenotype of interest (Figs. 2(b) and 2(c)). The

improvement of VB over EMþBIC was more substantial owing to the smaller

sample size of the discovery set. Importantly, those CpGs selected to exhibit struc-

ture and signi¯cant Rand Index values under the VB model in the discovery sets, also

exhibited much higher ARI values in the corresponding test sets, compared to those

features selected under the EMþBIC model (Fig. 2(d)). This indicates that the VB

model improves the reproducibility and is far superior to EMþBIC in identifying the

most relevant features. Indeed, many of the CpGs predicted to exhibit clustering

under EMþBIC in the discovery set were not replicated in the evaluation set.

3.3. Improved positive predictive value on feature selection

from all 27k CpGs

So far, our analysis has focused on 138 CpGs which mark lymphocyte and granu-

locyte markers and which therefore should be discriminatory of cancer/normal status

as explained in Ref. 43. To show that the improved positive predictive value of VB

over EMþBIC is independent of this prior selection of CpGs, we next considered all

25,642 CpGs on the Illumina In¯nium Beadchip. The data set was split into two

mutually exclusive partitions of 130 (74 healthy and 56 cases) and 131 (74 healthy

and 57 cases) samples. Because of the high-computational cost associated with

running EMþBIC individually on each of 25,642 CpGs, we used t-test P-values to

(a) (b)

Fig. 1. (a) Adjusted Rand Index (ARI, y-axis) is compared between CpGs exhibiting structure under

EMþ BIC (BIC), CpGs exhibiting structure only under the VB model (OnlyVB) and CpGs exhibiting

structure under the VB model (VB). The number of CpGs in each class is given in brackets. The P-valuesare from a Wilcoxon rank sum test comparing each of the OnlyVB and VB categories to the BIC category.

(b) An example DNA methylation pro¯le of a CpG, for which the VB model inferred structure, but which

under the EMþBIC model did not. The clusters inferred using the VB model are shown in di®erent colors.

Squares denote controls (N), diamonds denote cases (C). The distribution of cases and controls in eachcluster is given together with the associated Fisher-test P -value.

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-9

Page 10: A variational Bayes beta mixture model for feature selection in DNA methylation studies

¯rst rank and select 1000 features in each partition. Of the two separate 1000 CpG

lists, 537 overlapped. Running EMþBIC and VB separately on each of these two

lists of 1000 features, we next selected the CpGs exhibiting clustering structure. Only

14 of the 537 overlapping CpGs exhibited structure in both partitions under

EMþBIC, in stark contrast to 428 overlapping CpGs under the VB model. Com-

paring the adjusted Rand Indices of these subsets of CpGs in the corresponding test

set partition further showed that these were higher for the CpGs selected under the

VB model (Fig. 3). In all cases, the ARI values were signi¯cantly higher than ran-

dom, demonstrating once again that the inferred clusters correlate signi¯cantly with

cancer/normal status.

3.4. Application to identifying prognostic markers

To demonstrate the practical utility of the VBBMM in an omic context, we con-

sidered the problem of identifying prognostic DNA methylation markers. Prognostic

DNA methylation markers have been identi¯ed in many cancers.46,47,56,57 As with

blood-based diagnostic markers, the expected e®ect sizes of prognostic markers is

small, however, unlike diagnostic markers, we would expect a much smaller number

of DNAm markers to correlate with clinical outcome.46,47 Thus, this represents a

challenging scenario which may bene¯t from application of a clustering algorithm in

(a) (b) (c) (d)

Fig. 2. (a) Boxplots comparing the fraction of the 138 CpGs that exhibit structure under the EMþBIC

(BIC) and VB algorithms. The boxplots show the distributions of these fractions across 50 di®erent

discovery sets. Since the 138 CpGs were selected to exhibit structure associated with cancer/normal status,

we denote this fraction as the power obtained by feature selection algorithm (Power). (b) Boxplots of theadjusted Rand Index (ARI) values for the 138 CpGs, averaged over the 50 discovery sets. (c) Boxplots of

the mean ARI values over features exhibiting clustering in the discovery set. Here, each boxplot shows the

distribution of this mean ARI across the 50 distinct discovery sets. (d) Boxplots of the ARI values,averaged over selected features (i.e. those exhibiting clustering in the discovery set), as evaluated in the

corresponding evaluation/test set. Boxlots represent data over the 50 distinct discovery-test set partitions.

In panels a,c and d the P -value is from an unpaired two-tailed Wilcoxon rank sum test, in panel b from a

paired two-tailed rank sum test.

Z. Ma & A. E. Teschendor®

1350005-10

Page 11: A variational Bayes beta mixture model for feature selection in DNA methylation studies

the feature selection process. Indeed, we posited that identi¯cation of prognostic

markers may bene¯t from an additional clustering step, similar to the improvements

we noted previously in the gene expression context.7

Because of the computational cost of running a beta mixture model for �104

features, we here adopted the following two-step feature selection strategy:

(1) First, an intial feature selection is performed using standard statistics and

standard corrections for multiple testing. This yields an initial candidate list.

(2) Second, on this candidate feature list, we run the VBBMM algorithm to identify

those exhibiting structure which is compatible (i.e. correlated) with the pheno-

type of interest.

(3) A new statistic based on the inferred structure (the ARI) is introduced to rerank

the candidate features. We note that this procedure penalizes structureless fea-

tures and places them at the bottom of the list. This yields a new ¯nal ranked list

of candidate biomarkers.

Our hypothesis is that steps 2 and 3 improves the ranking of the features, promoting

true positives to the top of the list, while penalizing and eliminating false positives,

thus allowing more robust biomarkers to be identi¯ed. To determine the robustness

of the candidate biomarkers we use an independent validation set.

To test this idea, we used the breast cancer samples of DataSet2 (103 samples) as

a discovery set, ranking all 24589 CpGs according to a Cox-regression (with overall

survival as endpoint). As a candidate feature list, we selected the top 634 ranked

CpGs at an estimated false discovery rate (FDR) < 0:25. Thus, about a quarter of

the 634 CpGs are expected to be false positives. Next, we applied the VB algorithm

to each of these 634 CpGs, computed their ARI, and ¯nally prioritized them

according to their ARI value. We veri¯ed that many of the top 100 reranked CpGs

Fig. 3. Boxplots of adjusted Rand Index (ARI) values for CpGs, selected by t-tests and EMþBIC (in-

dicated as BIC) or VB from one partition, as evaluated in the mutually exclusive partition. In the

EMþBIC case there were only 14 overlapping CpGs exhibiting structure in each partition, while in the

VB case there were 428. In the VB case, we also plot 428 \null" ARI values obtained by taking the 95%quantile from 100 randomizations of the phenotype labels. P-values given are from a Wilcoxon rank sum

test comparing the distribution of ARI values between BIC and VB.

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-11

Page 12: A variational Bayes beta mixture model for feature selection in DNA methylation studies

had statistically signi¯cant or marginally signi¯cant ARI values (SuppFig. 1). A

total of 129 of the 634 CpGs were deemed structureless (ARI ¼ 0) by the VB al-

gorithm. Thus, we compared the group of 100 highly reranked CpGs (i.e. highest

ARI) to the 100 lowest reranked ones (i.e. with ARI ¼ 0), to determine which subset

validated better in an independent data set (DataSet3). Absolute Cox statistics were

signi¯cantly higher for the top 100 ARI-reranked CpGs compared to those with

ARI ¼ 0 or those highly ranked only by Cox-statistics (Fig. 4).

As another benchmark, we also compared the absolute Cox-statistics in the test

set against those of 100 CpGs selected using a sparse NMF58 in the discovery set.

NMF was applied in a semi-supervised context, analogous to the semi-supervised

PCA algorithm of Bair et al.6 Speci¯cally, NMF was applied to the data matrix of the

634 top Cox-ranked CpGs (FDR < 0:25), followed by selection of 100 CpGs with the

strongest weights in the basis NMF component with the largest absolute Cox-sta-

tistic in the discovery set. We observed that the top 100 ARI-reranked CpGs also had

higher Cox-statistics in the test set than those selected via NMF (Fig. 4).

The prognostic CpGs highly ranked under the VB model also showed a stronger

level of consistency than those with zero ARI values (Fig. 5 & SuppFig. 2). In fact,

the reranking induced by the ARI identi¯ed 28 hypomethylated prognostic CpGs

among the top ranked features (Fig. 5 & SuppFig. 2), twice as many as those with

ARI¼ 0. In contrast, if CpGs were ranked only according to their Cox-statistics, we

observed that this ranking did not in°uence the cross-validation accuracy (Fig. 5 &

SuppFig. 2). Importantly, twice as many CpGs validated among the top 100 VB/

ARI reranked CpGs, than among the top 100 Cox-ranked ones (Fig. 5 & SuppFig. 2),

supporting the view that the structural inference step can improve the prioritization

of relevant features in discovery/training sets. This result was robust to a more

Fig. 4. Comparison of the absolute Cox-statistics (Wald z-statistics) in the test set (DataSet3) of the

top 100 CpGs ranked according to the Cox-statistic (Cox) in the discovery set (DataSet2) against

the top 100 CpGs ranked according to the combined CoxþARI strategy (Coxþ high ARI) (in the

discovery set), 100 CpGs with zero ARI (ARI ¼ 0), and ¯nally also 100 CpGs selected from the NMF basiscomponent correlating strongest with survival in the discovery set (NMF1). Wilcoxon rank sum test

P-value between (Coxþhigh ARI) and the other two classes, as well as between (Coxþhigh ARI) and

NMF are given.

Z. Ma & A. E. Teschendor®

1350005-12

Page 13: A variational Bayes beta mixture model for feature selection in DNA methylation studies

stringent signi¯cance threshold used in the test set (SuppFig. 3). We also observed

that feature selection using the combined CoxþARI strategy was superior to that

provided by the top (NMF1) and top two (NMF2) NMF components (Fig. 5).

Of note, among the 28 validated hypomethylated CpGs (SuppTab. 2), 4 mapped

to genes (ATP2B3, SLC25A31, NOS3, ITPR2) involved in the KEGG Calcium

Signaling Pathway, a 15-fold enrichment (Odds Ratio ¼ 15:3, Fisher's exact test

P ¼ 0:0002). Interestingly, calcium signaling is required for activation of the Epi-

thelial to Mesenchymal transition (EMT) pathway and the associated silencing of

the E-Cadherin gene,59 and so, given that activation of EMT and low expression of

E-Cadherin is a hallmark of poor prognosis in breast cancer,14,60,61 the observed

hypomethylation of genes in the calcium signaling pathway in poor outcome breast

cancers is consistent with their overexpression and the observed overactivation of

EMT in these cancers.

4. Discussion and Conclusions

We have here proposed a novel feature selection algorithm which is speci¯c to DNA

methylation data generated with Illumina beadarrays. The algorithm is based on the

hypothesis that features with bi- or multi-modal DNA methylation pro¯les and for

which the structure is correlated to a phenotype of interest are more likely to be true

positives. As such, they are more likely to exhibit larger absolute statistics in inde-

pendent data. We have veri¯ed this in the context of cancer diagnostic markers in

whole blood and prognostic markers in breast cancer, both challenging scenarios in

which e®ect sizes are small.

Fig. 5. Barplots showing the number of CpGs, declared as signi¯cantly associated with clinical outcome in

the breast cancer training set, and which are also signi¯cant in the test set. These numbers are shown for

those hypermethylated (hyperM) and hypomethylated (hypoM) in poor prognosis breast cancers sepa-

rately, and across six di®erent feature selection strategies. (1) CoxþARI: Top 100 CpGs ranked accordingto Cox statistic and then reranked by ARI. (2) CoxþðARI ¼ 0): 100 CpGs ranked high by Cox-statistic

but with ARI=0. (3) Cox-high: Top 100 CpGs ranked highest by Cox-statistic regardless of ARI. (4) Cox-

low: 100 CpGs ranked high by Cox-statistic but only marginally signi¯cant. (5) NMF1: 100 CpGs withlargest weights in the basis NMF component with the most signi¯cant Cox-statistic. (6) NMF2: 100 CpGs

with largest weights in the two basis NMF components with the most signi¯cant Cox-statistics (top

50 CpGs selected from each).

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-13

Page 14: A variational Bayes beta mixture model for feature selection in DNA methylation studies

We have also seen that incorporating a clustering inference step in a univariate

fashion to prioritize ranked features, outperformed feature selection done via a pop-

ular multivariate dimensional reduction method (sparse NMF). Although we recently

demonstrated the power of NMF as an unsupervised dimensional reduction method,28

it is important to realize that NMF is not designed to identify individual features since

these are not easily inferred from the estimated NMF basis vectors. Thus, NMF, being

a multivariate dimensional reduction method, lacks the plasticity and hence power

required to identify all the features correlating with the phenotype of interest.

To infer structure in the DNA methylation pro¯les we used the VBBMM model.

The advantages of using a variational Bayes approach over EMþBIC are well

documented,41,48 and here we have con¯rmed, in the novel context of DNA meth-

ylation data, that VBBMM signi¯cantly outperforms EMþBIC in terms of sensi-

tivity and positive predictive value. Importantly, these improvements are obtained

at a reduced computational cost. For instance, with EMþBIC, it took about 10

minutes to run the algorithm for one feature, with up to six mixture components and

using 10 di®erent initializations on an Intel Core Processor i7-2720QM CUP at

2.20GHz. In contrast, using VB, it took only about 20 to 30 seconds to run one

feature, with a maximum of 6 components, and a total of 10 di®erent runs/intiali-

zations. Thus, the VB framework speeds up the analysis by over a factor of 1/10,

which is an important consideration if it is to be applied as a feature selection step in

large omic data sets. For instance, upcoming epigenome wide association studies

(EWAS) are using methylation beadchips with over 100,000 features.62 Parallelizing

the computation on an 8-core workstation would therefore take approximately four

days, or if access to a 30�40 node cluster is possible, the computation would take less

than a day. In contrast, using EMþBIC it would take 10 days even on a 30�40 node

cluster. Another important consideration is that the variational inference framework

may allow for further substantial speed enhancements (by factors of �10), through

use of information-geometric optimization methods.63,64 Hence, with VBBMM, even

a 100,000 feature data matrix would be manageable with an 8-core workstation. We

can conclude therefore that variational Bayes methods make the application of beta-

mixture models practical, in contrast to the EMþBIC framework which makes the

associated lengthy computations far less manageable.

It is important to also point out that in this work we have only explored the

VBBMM as a feature selection step, i.e. inferring structure in one-dimensional DNA

methylation pro¯les to identify true positives more reliably. One may also wish to

apply the VBBMM to cluster samples over more than one feature/dimension.

However, analytically, it is not yet possible to fully incorporate the covariance

structure of the features in the inference procedure, which thus precludes application

to clustering over more than one dimension. We leave this interesting and chal-

lenging question for a future investigation.

In summary, given that DNA methylation biomarkers for improved prognosis

and/or early detection of cancers are likely to be characterized by small e®ect

sizes,23,43,44 it is important to have powerful statistical algorithms in place that can

Z. Ma & A. E. Teschendor®

1350005-14

Page 15: A variational Bayes beta mixture model for feature selection in DNA methylation studies

help discern true from false positives. Thus, the variational Bayes beta mixture

model presented here should be of interest to any study embarking on DNA meth-

ylation pro¯ling including upcoming EWAS.23

Note Added

Legends to Supplementary Data

Supplementary Table S1

The list of 138 CpGs mapping to genes that are overexpressed/underexpressed in

granulocytes and lymphocytes. We provide the CpG probe identi¯er, the Entrez ID,

Gene Symbol and if over/under expressed in granulocytes/lymphocytes.

Supplementary Table S2

The list of 30 validated prognostic CpGs in breast cancer (28 hypomethylated in poor

prognosis and 2 hypermethylated in poor prognosis). We provide the CpG ID, the

Entrez ID, the gene symbol, the hazard ratio (HR), associated Cox z-statistic, P-value

and number of samples with clinical annotation in both discovery and test sets.

Supplementary Figure S1

Observed adjusted Rand Index values for the top 100 CpGs ranked according to the

Cox-statistic and adjusted Rand Index in the discovery set (DataSet2). The null

distributions of the adjusted Rand Index values are shown as boxplots in black and

were estimated from 1000 Monte Carlo runs (randomising phenotype labels). The

observed values are colored in red if the adjusted Rand Index value is signi¯cant at a

nominal P < 0:05 level, or in green if less signi¯cant. For the top 100 CpGs, all

adjusted Rand Index values had P -values less than 0.12.

Supplementary Figure S2

(A{B) Scatterplots of Cox-statistics (z-stat) in discovery set (DataSet2) (x-axis)

against those in the validation set (DataSet3) (y-axis). (A) shows the statistics for

the top 100 ranked CpGs according to a Cox-regression and the adjusted Rand Index

from the VB clustering in the discovery set. (B) shows the statistics for the 100 CpGs

bottom ranked by the adjusted Rand Index in discovery set. (C{D) As (A�B), but

(C) panel shows the scatterplot for the top 100 CpGs ranked only according to the

Cox-statistic in the discovery set, while panel (D) shows the scatterplot for the

bottom 100 Cox-ranked features. In all panels, the number of CpGs passing statis-

tical signi¯cance (Cox P-value P < 0:1) in the test set are indicated in black.

Supplementary Figure S3

Same as Supplementary Figure S2, but now using a Cox P-value threshold of

P < 0:05 in the test set.

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-15

Page 16: A variational Bayes beta mixture model for feature selection in DNA methylation studies

Acknowledgments

AET is supported by aHeller Research Fellowship. ZM is partly supported by internal

KTH funding. We wish to thank Martin Widschwendter for useful discussions.

References

1. Sawyers CL, The cancer biomarker problem, Nature 452:548�552, 2008.2. Leek JT, Storey JD, Capturing heterogeneity in gene expression studies by surrogate

variable analysis, PLoS Genet 3:1724�1735, 2007.3. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A, False discovery rate, sensi-

tivity and sample size for microarray studies, Bioinformatics 21:3017�3024, 2005.4. Leek JT Storey JD, A general framework for multiple testing dependence, Proc Natl Acad

Sci USA 105:18718�18723, 2008.5. Bourgon R, Gentleman R, Huber W, Independent ¯ltering increases detection power for

high-throughput experiments, Proc Natl Acad Sci USA 107:9546�9551, 2010.6. Bair E, Tibshirani R, Semi-supervized methods to predict patient survival from gene

expression data, PLoS Biol 2:E108, 2004.7. Teschendor® AE, Naderi A, Barbosa-Morais NL, Caldas C, Pack: Pro¯le analysis using

clustering and kurtosis to ¯nd molecular classi¯ers in cancer, Bioinformatics22:2269�2275, 2006.

8. Li L, Chaudhuri A, Chant J, Tang Z, Padge: Analysis of heterogeneous patterns ofdi®erential gene expression, Physiol Genomics 32:154�159, 2007.

9. Wang J, Wen S, Symmans WF, Pusztai L, Coombes KR, The bimodality index: Acriterion for discovering and ranking bimodal signatures from cancer gene expressionpro¯ling data, Cancer Inform 7:199�216, 2009.

10. Hellwig B, Hengstler JG, Schmidt M, Gehrmann MC, Schormann W, Rahnenfhrer J,Comparison of scores for bimodality of gene expression distributions and genome-wideevaluation of the prognostic relevance of high-scoring genes, BMC Bioinformatics 11:276,2010.

11. Bessarabova M, Kirillov E, Shi W, Bugrim A, Nikolsky Y, Nikolskaya T, Bimodal geneexpression patterns in breast cancer, BMC Genomics 11:S8, 2010.

12. Mpindi JP, Sara H, Haapa- Paananen S, Kilpinen S, Pisto T, Bucher E, Ojala K, Iljin K,Vainio P, Bjrkman M, Gupta S, Kohonen P, Nees M, Kallioniemi O, Gti: A novel algo-rithm for identifying outlier gene expression pro¯les from integrated microarray datasets,PLoS One 6:e17259, 2011.

13. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun XW, Varambally S,Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA,Chinnaiyan AM, Recurrent fusion of tmprss2 and ets transcription factor genes inprostate cancer, Science 310:644�648, 2005.

14. Teschendor® AE, Journe M, Absil PA, Sepulchre R, Caldas C, Elucidating the alteredtranscriptional programs in breast cancer using independent component analysis, PLoSComput Biol 3:e161, 2007.

15. Colombo PE, Milanezi F, Weigelt B, Reis-Filho JS, Microarrays in the 2010s: The con-tribution of microarray-based gene expression pro¯ling to breast cancer classi¯cation,prognostication and prediction, Breast Cancer Res 13:212, 2011.

16. Mosquera JM, Mehra R, Regan MM, Perner S, Genega EM, Bueti G, Shah RB, Gaston S,Tomlins SA, Wei JT, Kearney MC, Johnson LA, Tang JM, Chinnaiyan AM, Rubin MA,Sanda MG, Prevalence of tmprss2-erg fusion prostate cancer among men undergoingprostate biopsy in the united states, Clin Cancer Res 15:4706�4711, 2009.

Z. Ma & A. E. Teschendor®

1350005-16

Page 17: A variational Bayes beta mixture model for feature selection in DNA methylation studies

17. Teschendor® AE, Wang Y, Barbosa-Morais NL, Brenton JD, Caldas C, A variationalbayesian mixture modelling framework for cluster analysis of gene-expression data,Bioinformatics 21:3025�3033, 2005.

18. Baylin SB Ohm JE, Epigenetic gene silencing in cancer ��� A mechanism for early on-cogenic pathway addiction? Nat Rev Cancer 6:107�116, 2006.

19. Feinberg AP, Ohlsson R, Heniko® S, The epigenetic progenitor origin of human cancer,Nat Rev Genet 7:21�33, 2006.

20. Jones PA, Baylin SB, The epigenomics of cancer, Cell 128:683�692, 2007.21. Petronis A, Epigenetics as a unifying principle in the aetiology of complex traits and

diseases, Nature 465:721�727, 2010.22. Feinberg AP, Epigenomics reveals a functional genome anatomy and a new approach to

common disease, Nat Biotechnol 28:1049�1052, 2010.23. Rakyan VK, Down TA, Balding DJ, Beck S, Epigenome-wide association studies for

common human diseases, Nat Rev Genet 12:529�541, 2011.24. Teschendor® AE, Jones A, Fiegl H, Sargent A, Zhuang JJ, Kitchener HC, Widsch-

wendter M, Epigenetic variability in cells of normal cytology is associated with the risk offuture morphological transformation, Genome Med 4:24, 2012.

25. Bibikova M, Fan JB, Genome-wide dna methylation pro¯ling, Wiley Interdiscip Rev SystBiol Med 2:210�223, 2010.

26. Sandoval J, Heyn H, Moran S, Serra-Musach J, Pujana MA, Bibikova M, Esteller M,Validation of a dna methylation microarray for 450,000 cpg sites in the human genome,Epigenetics 6:692�702, 2011.

27. Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, Lin SM, Comparison of beta-value and m-value methods for quantifying methylation levels by microarray analysis,BMC Bioinformatics 11:587, 2010.

28. Zhuang J, Widschwendter M, Teschendor® AE, A comparison of feature selection andclassi¯cation methods in dna methylation studies using the illumina 27k platform, BMCBioinformatics 13:59, 2012.

29. Bar¯eld RT, Kilaru V, Smith AK, Conneely KN, CpGassoc: An R function for analysis ofDNA methylation microarray data, Bioinformatics 28(9):1280-1, 2012.

30. Kilaru V, Bar¯eld RT, Schroeder JW, Smith AK, Conneely KN, Methlab: A graphicaluser interface package for the analysis of array-based dna methylation data, Epigenetics7:225�229, 2012.

31. Laurila K, Oster B, Andersen CL, Lamy P, Orntoft T, Yli-Harja O, Wiuf C, A beta-mixture model for dimensionality reduction, sample classi¯cation and analysis, BMCBioinformatics 12:215, 2011.

32. Koestler DC, Marsit CJ, Christensen BC, Karagas MR, Bueno R, Sugarbaker DJ, KelseyKT, Houseman EA, Semi-supervised recursively partitioned mixture models for identi-fying cancer subtypes, Bioinformatics 26:2578�2585, 2010.

33. Kuan PF, Wang S, Zhou X, Chu H, A statistical framework for illumina dna methylationarrays, Bioinformatics 26:2849�2855, 2010.

34. Houseman EA, Christensen BC, Karagas MR, Wrensch MR, Nelson HH, Wiemels JL,Zheng S, Wiencke JK, Kelsey KT, Marsit CJ, Copy number variation has little impact onbead-array-based measures of dna methylation, Bioinformatics 25:1999�2005, 2009.

35. Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, NelsonHH, Wiemels J, Zheng S, Wiencke JK, Kelsey KT, Model-based clustering of dnamethylation array data: A recursive-partitioning algorithm for high-dimensional dataarising as a mixture of beta distributions, BMC Bioinformatics 9:365, 2008.

36. Ji Y, Wu C, Liu P, Wang J, Coombes KR, Applications of beta-mixture models inbioinformatics, Bioinformatics 21:2118�2122, 2005.

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-17

Page 18: A variational Bayes beta mixture model for feature selection in DNA methylation studies

37. Sun H, Wang S, Penalized logistic regression for high-dimensional dna methylation datawith case-control studies, Bioinformatics 28:1368�75, 2012.

38. Ma Z, Leijon A, Bayesian estimation of beta mixture models with variational inference,IEEE Trans Pattern Anal Machine Intel 33(11):2160�2173, 2011.

39. Dempster AP, Laird NM, Rubin DB, Maximum likelihood from incomplete data via theem algorithm, J Roy Stat Soc B 39:1�38, 1977.

40. Schwarz G, Estimating the dimension of a model, Annls.Stat. 6:461�464, 1978.41. Attias H, Inferring parameters and structure of latent variable models by variational

bayes, Proc 15th Conf Uncertainty in Arti¯cial Intelligence, pp. 21�30, 1999.42. MacKay DJ, Developments in probabilistic modelling with neural networks-ensemble

learning, Neural Networks: Arti¯cial Intelligence and Industrial Applications. Proc 3rdAnnual Symp on Neural Networks, Springer, Nijmengen, pp. 191�198, 1995.

43. Teschendor® AE, Menon U, Gentry-Maharaj A, Ramus SJ, Gayther SA, Apostolidou S,Jones A, Lechner M, Beck S, Jacobs IJ, Widschwendter M, An epigenetic signature inperipheral blood predicts active ovarian cancer, PLoS One 4:e8274, 2009.

44. Teschendor® AE, Menon U, Gentry-Maharaj A, Ramus SJ, Weisenberger DJ, Shen H,Campan M, Noushmehr H, Bell CG, Maxwell AP, Savage DA, Mueller- Holzner E, MarthC, Kocjan G, Gayther SA, Jones A, Beck S, Wagner W, Laird PW, Jacobs IJ, Widsch-wendter M, Age-dependent dna methylation of genes that are suppressed in stem cells is ahallmark of cancer, Genome Res 20:440�446, 2010.

45. Palmer C, Diehn M, Alizadeh AA, Brown PO, Cell-type speci¯c gene expression pro¯lesof leukocytes in human peripheral blood, BMC Genomics 7:115, 2006.

46. Zhuang J, Jones A, Lee SH, Ng E, Fiegl H, Zikan M, Cibula D, Sargent A, Salvesen HB,Jacobs IJ, Kitchener HC, Teschendor® AE, Widschwendter M, The dynamics andprognostic potential of dna methylation changes at stem cell gene loci in women's cancer,PLoS Genet 8:e1002517, 2012.

47. Fackler MJ, Umbricht CB, Williams D, Argani P, Cruz LA, Merino VF, Teo WW, ZhangZ, Huang P, Visvananthan K, Marks J, Ethier S, Gray JW, Wol® AC, Cope LM,Sukumar S, Genome-wide methylation analysis identi¯es genes speci¯c to breast cancerhormone receptor status and risk of recurrence, Cancer Res 71:6195�6207, 2011.

48. Bishop, CM, Pattern Recognition and Machine Learning, Springer, New York, 2006.49. Jordan MI, Learning in Graphical Models, MIT Press, Boston, 1999.50. Ma Z, Leijon A, Beta mixture models and the application to image classi¯cation, Proc Int

Con Image Processing, pp. 2045�2048, 2009.51. Satomi A, Murakami S, Ishida K, Mastuki M, Hashimoto T, Sonoda M, Signi¯cance

of increased neutrophils in patients with advanced colorectal cancer, Acta Oncol34(1):69�73, 1995.

52. Yamanaka T, Matsumoto S, Teramukai S, Ishiwata R, Nagai Y, Fukushima M, Thebaseline ratio of neutrophils to lymphocytes is associated with patient prognosis in ad-vanced gastric cancer, Oncology 73(3�4):215�220, 2007.

53. Rand WM, Objective criteria for the evaluation of clustering methods, J American StatAssoc 66(336):846�850, 1971.

54. Hubert L, Comparing partitions, J Classif 2:193�218, 1985.55. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL, Model-based clustering and data

transformations for gene expression data, Bioinformatics 17:977�987, 2001.56. TCGA, Comprehensive genomic characterization de¯nes human glioblastoma genes and

core pathways, Nature 455:1061�1068, 2008.57. TCGA, Integrated genomic analyses of ovarian carcinoma, Nature 474:609�615, 2011.58. Gaujoux R, Seoighe C, A °exible r package for nonnegative matrix factorization, BMC

Bioinformatics 11:367, 2010.

Z. Ma & A. E. Teschendor®

1350005-18

Page 19: A variational Bayes beta mixture model for feature selection in DNA methylation studies

59. Wu CH, Tang SC, Wang PH, Lee H, Ko JL, Nickel-induced epithelial-mesenchymaltransition by reactive oxygen species generation and e-cadherin promoter hypermethy-lation, J Biol Chem 287:25292�25302, 2012.

60. Creighton CJ, Li X, Landis M, Dixon JM, Neumeister VM, Sjolund A, Rimm DL, WongH, Rodriguez A, Herschkowitz JI, Fan C, Zhang X, He X, Pavlick A, Gutierrez MC,Renshaw L, Larionov AA, Faratian D, Hilsenbeck SG, Perou CM, Lewis MT, Rosen JM,Chang JC, Residual breast cancers after conventional therapy display mesenchymal aswell as tumor-initiating features, Proc Natl Acad Sci USA 106:13820�13825, 2009.

61. Creighton CJ, Chang JC, Rosen JM, Epithelial-mesenchymal transition (emt) in tumor-initiating cells and its clinical implications in breast cancer, J Mammary Gland BiolNeoplasia 15:253�260, 2010.

62. Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP,Gunderson KL, Fan JB, Shen R, High density dna methylation array with single cpg siteresolution, Genomics 98:288�295, 2011.

63. Girolami M, Calderhead B, Riemann manifold langevin and hamiltonian monte carlomethods, J Royal Stat Society: Series B (Statistical Methodology) 73(2):123�214, 2011.

64. Hensman J, Rattray M, Lawrence ND, Fast variational inference in the conjugate ex-ponential family, Arxiv preprint arXiv:1206.5162, 2012.

ZhanyuMa received his M.Eng. degree in Signal and Information

Processing from BUPT (Beijing University of Posts and Tele-

communications), China, and his Ph.D. degree in Electrical

Engineering from KTH (Royal Institute of Technology), Sweden,

in 2007 and 2011, respectively. Since 2012, he is a Postdoc re-

searcher in the School of Electrical Engineering, KTH, Sweden.

His research interests include statistical modeling and machine

learning related topics with a focus on applications in speech

processing, image processing, and bioinformatics.

Andrew E Teschendor® received his B.Sc. in Mathematical

Physics from Edinburgh University and his Ph.D. in Theoretical

Particle Physics from the University of Cambridge, UK. He now

leads the Statistical Cancer Genomics group at the UCL Cancer

Institute, University College London, UK. His research interests

includes statistical genomics and epigenomics with a focus on

applications to cancer, as well as network physics.

A Variational Bayes Beta Mixture Model for Feature Selection in DNA

1350005-19