Biclustering with heterogeneous variance - PNAS · Biclustering with heterogeneous variance Guanhua Chena, Patrick F. Sullivanb,c, and Michael R. Kosoroka,d,1 Departments of aBiostatistics,

Biclustering with heterogeneous varianceGuanhua Chena, Patrick F. Sullivanb,c, and Michael R. Kosoroka,d,1

Departments of aBiostatistics, bGenetics, cPsychiatry, and dStatistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599

Edited by Xiaotong Shen, University of Minnesota, Minneapolis, MN, and accepted by the Editorial Board June 4, 2013 (received for review March 7, 2013)

In cancer research, as in all of medicine, it is important to classifypatients into etiologically and therapeutically relevant subtypes toimprove diagnosis and treatment. One way to do this is to useclustering methods to find subgroups of homogeneous individualsbased on genetic profiles together with heuristic clinical analysis.A notable drawback of existing clustering methods is that theyignore the possibility that the variance of gene expression profilemeasurements can be heterogeneous across subgroups, and meth-ods that do not consider heterogeneity of variance can lead toinaccurate subgroup prediction. Research has shown that hyper-variability is a common feature among cancer subtypes. In thispaper, we present a statistical approach that can capture bothmeanand variance structure in genetic data. We demonstrate thestrength of our method in both synthetic data and in two cancerdata sets. In particular, our method confirms the hypervariability ofmethylation level in cancer patients, and it detects clearer subgrouppatterns in lung cancer data.

Clustering is an important type of unsupervised learning al-gorithm for data exploration. Successful examples include K-

mean clustering and hierarchical clustering, both of which arewidely used in biological research to find cancer subtypes and tostratify patients. These and other traditional clustering algo-rithms depend on the distances calculated using all of the fea-tures. For example, individuals can be clustered into homogeneousgroups by minimizing the summation of within-clusters sum ofsquares (the Euclidean distances) of their gene expression pro-files. Unfortunately, this strategy is ineffective when only a subsetof features is informative. This phenomenon can be demon-strated by K-means clustering (1) results for a toy example usingonly the variables which determine the underlying true clustercompared with using all variables (which includes many un-informative variables). As can be seen in Fig. 1, clustering per-formance is poor when all variables are used in the clusteringalgorithm (2).To solve this problem, sparse clustering methods have been

proposed to allow clustering decisions to depend on only a subsetof feature variables (the property of sparsity). Prominent sparseclustering methods include sparse principal component anal-ysis (PCA) (3–5) and Sparse K-means (2), among others (6).However, sparse clustering still fails if the true sparsity is a localrather than a global phenomenon (6). More specifically, differentsubsets of features can be informative for some samples but notall samples, or, in other words, sparsity exists in both featuresand samples jointly. Biclustering methods are a potential solu-tion to this problem, and further generalize the sparsity principleby considering samples and features as exchangeable concepts tohandle local sparsity (6, 7). For example, gene expression datacan be represented as a matrix with genes as columns, andsubjects as rows (with various and possibly unknown diseases ortissue types). Traditional methods will either cluster the rows—as done, for example, in microarray research, where researcherswant to find subpopulation structure among subjects to identifypossible common disease status—or cluster the columns, asdone, for example, in gene clustering research, where genes areof interest and the goal is to predict the biological function ofnovel genes from the function of other well-studied genes within thesame clusters. In contrast, biclustering involves clustering rowsand columns simultaneously to account for the interaction of row

and column sparsity. This local sparsity perspective providesan intuition for using sparse singular value decomposition(SSVD) algorithms for biclustering (8–11). SSVD assumesthat the signal in the data matrix can be representedby a low-rank matrix X≈UDVT =

Pri=1diuiv

Ti with X∈ℜn× p.

U= ½u1; u2; . . . ; ur�∈ℜn× r and V= ½v1; v2; . . . ; vr�∈ℜr × p containleft and right sparse singular vectors and are orthonormal withonly a few nonzero elements (corresponding to local sparsity).D∈ℜr × r is diagonal (with diagonal elements d1; d2; . . . ; dr) withr � rankðXÞ. The outer product of each pair of sparse singularvectors (uivTi , i= 1; 2; . . . ; r) will designate two biclusters corre-sponding to positive and negative elements, respectively.A common assumption of existing SSVD biclustering methods

is that the observed data can be decomposed into a signal matrixplus a fully exchangeable random noise matrix:

X=Ξ+Φ; [1]

where X is the observed data, Ξ= ðξijÞ is an n× p matrix repre-senting the signal, and Φ= ðϕijÞ is an n× p random noise/residualmatrix with independent identically distributed (i.i.d.) entries(10, 12, 13). A method based on model 1 is proposed in ref. 9which minimizes the sum of the Frobenius norm of X− Ξ anda penalty function with variable selection, such as the ℓ1 − norm(14) or smoothly clipped absolute deviation (15). A similar lossplus penalty minimization approach can be seen in ref. 11. Adifferent method for SSVD employs iterative thresholding QRdecomposition to estimate Ξ in ref. 10. We refer to ref. 9 asLSHM (for Lee, Shen, Huang, and Marron) and ref. 10 as fastiterative thresholding for SSVD (FIT-SSVD), and comparethese approaches to our method. An alternative approach, whichis more direct, is based on a mixture model (16, 17). For exam-ple, ref. 17 defines the bicluster as a submatrix with a large pos-itive or negative mean. Although these approaches have provensuccessful in some settings, they are limited by their focus on onlythe mean signal approximation. In addition, the explicit homo-geneous residual variance assumption is too restrictive inmany applications.To our knowledge, the only extension of the traditional model

given in [1] is the generalized PCA approach (18), which assumesthat if the random noise matrix were stacked into a vector,vecðΦÞ, it would have mean 0 and variance R−1 ⊗Q−1, whereR−1 is the common covariance structure of the random variableswithin the same column, and Q−1 is the common covariancestructure of the random variables within the same row. Thisapproach is especially suited to denoising NMR data for whichthere is a natural covariance structure of the form given above(18). Drawbacks of the generalized PCA method, however, are

Author contributions: G.C., P.F.S., and M.R.K. designed research; G.C., P.F.S., and M.R.K.performed research; G.C. and M.R.K. contributed new reagents/analytic tools; G.C. ana-lyzed data; and G.C., P.F.S., and M.R.K. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission. X.S. is a guest editor invited by the EditorialBoard.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1304376110/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1304376110 PNAS | July 23, 2013 | vol. 110 | no. 30 | 12253–12258

STATIST

ICS

GEN

ETICS

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 4,

202

0

mailto:[email protected]

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1304376110/-/DCSupplemental

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1304376110/-/DCSupplemental

www.pnas.org/cgi/doi/10.1073/pnas.1304376110

that it remains focused on mean signal approximation and thestructure of R−1 and Q−1 must be explicitly known in advance.In this paper, we present a biclustering framework based on

SSVD called heterogeneous sparse singular value decomposition(HSSVD). This method can detect both mean biclusters andvariance biclusters in the presence of unknown heterogeneousresidual variance. We also apply our method, as well as com-peting approaches, to two cancer data sets, one with methylationdata and the other with gene expression data. Our methoddelivers more distinct genetic profile pattern detection and is ableto confirm the biological findings originally made for each of thedata sets. We also apply our method as well as other competingapproaches on synthetic data to compare their performancequantitatively. We demonstrate that our proposed method is ro-bust, location- and scale invariant, and computationally feasible.

Application to Cancer DataHypervariability of Methylation in Cancer. We demonstrate thecapability of variance bicluster detection with methylation datain cancer versus normal patients (19). The experiments wereconducted by a custom nucleotide-specific Illumina bead array toincrease the precision of DNA methylation measurements onpreviously identified cancer-specific differentially methylatedregions (cDMRs) in colon cancer (20). The data set (GEO ac-cession: GSE29505) consists of 290 samples including cancersamples (colon, breast, lung, thyroid, and Wilms’ tumor cancers)and matched normal samples. Each sample had 384 methylationprobes which covered 151 cDMRs. The authors of the primaryreport concluded that cancer samples had hypervariability inthese cDMRs across all cancer types (19).First, we wish to verify that HSSVD can provide a good mean

signal approximation of methylation. In this data set, all of theprobes measuring the methylation are placed in the cDMRsidentified in colon cancer patients. As a result, we would expectthat mean methylation levels differ between colon cancer sam-ples and the matched normal samples. Under this assumption,we require the biclustering methods to capture this mean structure

before investigating the information gained from variancestructure estimation. Note that the numerical range of methyl-ation level is between 0 and 1. Hence, we applied the logittransformation on the original data for further biclusteringanalysis. We compare three methods, HSSVD, FIT-SSVD andLSHM, all based on SVD. Only colon cancer samples and theirmatched normal samples are used for this particular analysis. InFig. 2, we can see from the hierarchical clustering analysis thatthe majority of colon cancer samples (labeled blue in the side-bar) are grouped together and most of the cDMRs are differ-entially expressed in colon tumor samples compared with normalsamples. The conclusion is the same for all three methodscompared, including our proposed HSSVD method.Second, our proposed HSSVD method confirms the most

important finding in ref. 19 that cancer samples tended to havehypervariability in methylation level regardless of tumor subtype.We compared the mean approximation and variance approxi-mation results of HSSVD. All samples were used in this analysis.The variance approximation of HSSVD (Fig. 3A) shows thatnearly all normal samples have low variance compared with cancersamples, and this pattern is consistent across all cDMRs. Notably,our method provides additional information beyond the conclu-sion from ref. 19. Specifically, our variance approximation suggeststhat some cancer samples are not characterized by hypervariabilityin methylation level for certain cDMRs. More precisely, somecDMRs for a few cancer samples (surrounded by normal samples)are predicted to have low variance (lower left part of Fig. 3A). Ourmethod also highlights cDMRs with the greatest contrast variancebetween cancer and normal samples. The corresponding cDMRswith high contrast variance (especially some of the first and middlecolumns of Fig. 3A) warrant further study for biological and clin-ical relevance. We also want to emphasize that the analysis in ref.19 relies on the disease status information, whereas for HSSVDthe disease status is only used for result interpretation. Note thatmost cancer patients cluster together by hierarchical clustering ofthe variance approximation from HSSVD. In contrast, cluster-ing the mean approximation from HSSVD in Fig. 3B fails to

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

True Label

X1

X2

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

Predicted Label (two variables)

X1

X2

−1.5 −0.5 0.0 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.00.5

1.01.5

Predicted Label (all variables)

X1

X2

Fig. 1. Data set contains two clusters determined by two variables X1 and X2 such that points around ð1;1Þ and ð−1; − 1Þ naturally form clusters. There are200 observations (100 for each cluster) and 1,002 variables (X1, X2 and 1,000 random noise variables). We plot the data in the 2D space of X1 and X2. Graphswith true cluster labels and predicted cluster labels obtained by clustering using only X1 and X2 and clustering by using all variables are laid from left to right.The predicted labels are the same as the true labels only when X1 and X2 are used for clustering; however, the performance is much worse when all variablesare used.

12254 | www.pnas.org/cgi/doi/10.1073/pnas.1304376110 Chen et al.

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 4,

202

0


reveal such a pattern. This indicates that most cancer samplesmay have hypervariability of methylation as a common featurewhereas their mean-level methylation varies from sample tosample. Hence, identifying variance biclusters can provide po-tential new insight for cancer epigenesis.

Gene Expression in Lung Cancer. Some biological settings, in con-trast with the methylation example above, do not express vari-ance heterogeneity. Usually, the presence or absence of suchheterogeneity is not known in advance for a given research dataset. Thus, it is important to verify that the proposed approachremains effective in either case for discovering mean-only biclus-ters. We now demonstrate that even in settings without varianceheterogeneity, HSSVD can better identify discriminative biclustersfor different cancer subtypes than other methods, includingFIT-SSVD (10), LSHM (9), and traditional SVD. We use alung cancer data set which has been studied in the statistics litera-ture (9, 10, 17). The samples are a subset of patients (21) havinglung cancer with gene expression measured by the Affymetrix95av2 GeneChip (22). The data set contains the expression levelsof 12,625 genes for 56 patients, each having one of four dis-ease subtypes: normal lung (20 samples), pulmonary carcinoidtumors (13 samples), colon metastases (17 samples), and small-cell carcinoma (6 samples).The performance of different methods is evaluated based on

the pattern difference of subtypes based on the mean approx-imations. For all methods, we set the rank of the mean signalmatrix equal to 3 to maintain consistency with the ranks used inFIT-SSVD (10) and LSHM (9). Further, we use the measure-ment “support” to evaluate the sparsity of the estimated genesignal (10). Support is the cardinality of the nonzero elements in

the right and left singular vectors across the three layers (i.e.,support is an integer that cannot exceed the data dimension).Smaller support values suggest a sparser model. Table 1 showsthat HSSVD, FIT-SSVD and LSHM yield similar levels ofsparsity in the gene signal, whereas SVD is not sparse, as expec-ted. Fig. 4 shows checkerboard plots of rank-three approximationsby the four methods. Patients are placed on the vertical axis, andthe patient order is the same for all images. Patients within thesame subtype are stacked together and different subtypes areseparated by white lines. Within each image, genes are laid on thehorizontal axis and are ordered by the value of v2 (10). We cansee a clear block structure in both the FIT-SSVD and HSSVDmethods, indicating biclustering. The block structure suggests wecan discriminate the four cancer subtypes using either the FIT-SSVD or HSSVD methods, whereas LSHM and SVD are unableto achieve such separation among subtypes.

Simulation StudyTo evaluate the performance of HSSVD quantitatively, weconducted a simulation study. We compared HSSVD with themost relevant existing biclustering methods, FIT-SSVD andLSHM (9, 10). HSSVD includes a rank estimation component,whereas the other methods do not automatically include this. Forthis reason, we will use a fixed oracle rank (at the true value) forthe non-HSSVD methods. For comparison, we also evaluateHSSVD with fixed oracle rank (HSSVD-O).The performance of these methods on simulated data was

evaluated on four criteria. The first criterion is “sparsity of es-timation,” defined as the ratio between the size of the correctlyidentified background cluster and the size of the true backgroundcluster. The second criterion is “biclustering detection rate,”

155

117 87 201

348 23 321

337 48 33 366 47 326

311

265

118

153 50 279

242

188

310

255

303

269

141 88 178

261 99 343

109

328

275

130 20 260

317 83 218

215

145 52 294

179 56 68 240

166

111 43 243 91 65 372

198

262

291 26 71 176

102

157 37 4 349

301

252 32 72 70 375

214

383

322

371 13 89 258

268

120

356

172

213

306

148

205

345

190

314

127

228 46 332

164

193

183

342 51 237

249

251

199

163 62 175

216

203 27 54 152

158

197

146 2

119

143

331

184 49 21 327

133 63 280

233

236

220

129

150

151

377

384

382

380

379

378

376

374

370

362

361

359

355

354

344

339

336

330

325

323

319

316

313

305

302

300

298

290

289

285

283

282

281

276

274

256

254

253

248

247

246

245

239

232

227

226

224

223

217

212

210

209

208

207

206

204

202

194

186

182

177

169

165

162

161

160

156

154

142

139

138

135

132

131

128

121

116

115

114

113

110

108

107

106

105

104

101 98 96 95 94 93 92 90 86 84 82 80 79 78 76 74 73 67 66 61 57 55 53 41 36 31 25 19 17 16 14 11 3 5 35 7 284 81 39 159 30 264

315

346

149 18 8 136

191

187 6

181

173

189

195 85 29 97 134 45 250

335

200 44 334 58 351

225

112

271

222

287

211

140

234

270

360

263

259

350

312

125

229

278

365

241

126 9

238

219

367 75 373 59 100

286

272

122 69 12 307

308

381

297 28 320

185 24 333

235

304

369 10 77 363

357

170 60 174

329

144

358

230

244

368

292

266

347

318 15 353

192

340 38 167

267

296 42 123

180

273

231

324

147

338

124

293

352

196

309 40 364

171 34 277

221

295

341

299

257

103

168

288 64 137 22 1

2482325171316132921263322912141041815752644194720114928313250393546273840454330413642343748

68 87 20 155 88 294

326

321

275

269

337 23 47 178

117 99 348

214 33 48 343

242

265

317

141

166

213

145

261

311

260

215 56 120

314

130

366

109

375 89 203

345

258

190

205

383

356

306 91 201 32 252

255

303

218 83 52 111

118 50 310

240

148

127 72 179

328

164

199

237

251

175

384

380

379

378

377

376

374

371

370

359

344

342

339

332

331

330

327

325

323

322

313

302

300

290

285

283

282

281

280

276

274

268

256

254

253

249

247

246

245

239

236

233

232

228

227

223

220

216

212

210

209

208

207

206

204

197

194

193

184

183

182

172

169

165

163

158

156

154

152

151

150

146

143

142

139

138

133

131

129

128

121

119

116

115

114

108

107

106

105

104

101 98 96 95 94 93 92 90 86 84 82 79 76 73 70 66 63 62 57 55 54 53 51 49 46 41 36 31 27 25 21 19 17 16 14 13 2 5 3 224

382

135 61 243

349

198

102 43 37 262

188 4

153 71 372 65 279

291 26 176

157

301 35 7 284 39 81 335

159

312 85 30 195

315

346

264

187 29 360

173

234

189

134 58 181 6

200

225

229 45 97 334

149 18 136

270

191 24 8 44 298 78 202

362

177

355

289

186

160

132

319

110

113

217

361

305

162 11 167 12 69 272

341 74 75 373

286

367

122

296

354

336 67 80 248

100 59 42 267

316

238

161

241

126

278

365 60 221

353 40 318 28 381

357

350

259

364

287

271 15 185

219

140

222

333

211

351

125

263

250

231 38 147

309

123 22 137

257

338 1

273

226

340

358

299

293

180

288 9 77 64 192

103

277

304

144

329

174

368

230

324

124

266

369

112

295

168

352

196

235

244

347

170 34 320

297 10 363

171

308

292

307

4237343630414835432832313927464538504047471049919121815142920112233624321517442381316125226

343

242 33 311

366

109 56 215

348 52 269

321 23 178 88 326

294

166

213

317

275 32 111 91 201

303

310 50 252

255

118 68 87 117 99 20 155

179 83 218 47 337

141

265 48 127

214 72 164

240

148

328

199

145

261 89 237

175

251

203

205

345

314

190

383

130

375

120 51 342

183

331

236

150

216

133

131

121

302

246 5

374 63 207

104

290

210 21 377 41 359 73 139 19 194 36 55 245

339

376

281

154

300

276

282 57 172

371 70 322

268 13 62 158

184

152 27 46 249

193

332

197 54 143

163

129 14 128 86 274

280

233

165

327

223

220

209 90 53 79 151

142 82 323 84 98 101

206

208

260

258

306

356

228

146 2

212 49 119

198

102

243

349 43 37 65 262

188 4 71 372

157

176 26 291

279

153

301 31 325

108 94 135

224 16 330

227

169 93 106 61 204

382

379

380

160

289 67 365

241

316 95 283

156 11 76 305

353

295

221 40 364 15 226 74 298

340

180 9

103

171

292 17 254

115

107

285

256

232 96 92 138

239

247

344

370

116

105

182

378

384 25 66 336

319

132

110

113

177 80 38 12 147

100

161 75 373

186

114

361 3

253

313

162

217 78 355

202

362 69 278

286

272

354

248 42 59 267

126

238

167

231

341

367

122

296 22 273

123

309

358

299 1

137

257

338 44 30 189

181 58 24 29 234 7 39 335 35 284 81 85 195

187

159

312

264

315

346

134

333

225

360

173 6

200

229 45 97 334

149 18 8 136

270

191

230

277

124

324

347

363

304

288

293

192

318 64 196

168

352 60 357

381 28 263

259

287

350

271

185

250

219

140

222

125

211

351 34 368

266

369

144

329

174

235

112 77 244

320

170

297 10 308

307

3836444332493929484137424550474031353033344725234610199281215181427116651320112172621832224

Fig. 2. Mean approximation of colon cancer and the normal matched samples. From left to right the methods are HSSVD, FIT-SSVD, and LSHM. Colon cancersamples are labeled in blue, and normal matched samples are labeled in pink in the sidebar. Genes and samples are ordered by hierarchical clustering. Coloncancer patients are clustered together, which indicates that the mean approximations for these three methods achieve the expected signal structure.

147 34 91 345

333

259

336

309

315

244

205

159

360

261

301

155

286 69 246

178

271 44 141

123

136

369

338 4 97 76 111

102

308

118

231

168

143 65 124

320

314

149

255

170

257

384

220

343

117

349 43 291 26 176

307

100 10 350

344

250

222

157

198

304

262 81 334

187

229 47 348

381

166

277

288

216

317

218 87 179 20 71 326

328

372

251

134 35 171 24 272

139

368

125

355

292 46 235

353 58 77 217

242

356 48 25 310 66 283

172 41 200

137 33 29 15 278

153

279

337 37 50 232

245

376 13 7 346 85 154

293 45 174

234

181

263

173

189

191

335

284 8

273 98 327

341

358

266 39 140

264

312 6

295

221

249

313

342

225 90 252

281

167

113 95 63 182 22 75 378

177

370 1 3

188

164 83 72 127

148

240

303 42 88 145 30 196

199

183

270

365

306

112 86 51 52 362

211

185

311 99 243

175

129 59 233

289

169

247

108

212

142

302 21 254

379

107 19 207

201

373

268

366 53 120

297 68 121

265

361

110 28 96 296 2

347 64 285

193

275

132

363

339

321

331

156 38 274 60 213 73 236

325 67 93 224

352 70 119

214

106

160 89 374

146

210

126

238 17 208

340

248

158 80 54 367

104

194 36 329

267 74 152

135

298

258

165

228 61 128 57 209

130 94 14 79 150

323 5 49 253

204

230 62 18 269

371

197

109

186

383 23 300

237 84 184

161

138

203

223

115 82 114

380

280

332

101

133

359 92 192

195 11 324

322

219

282

276

377

202

105 27 319 55 144

103

180

116

318

256

206

215 32 351

239

151

287 12 122

260 16 305

354

162

316

364 40 227

241

131

294

375

226

290

330

163

382 9 31 357 56 299 78 190

204212681510712221226316412425182531813176744567992696372617366685564597454706071576562755811190461311381131371142811671663710827720313513028414194267192161155227451634715711912325822224151444036343130352850424839321684338513329414917999847680971008877858798957890828193969189948386118134120116121136132102106127103129128105126101104111271125272257109110133243221213192285170145252171154241155148165256176177185169195181253180199288143149224270150153193159161152174162236278280279255242240239235205204189178156147146139117231121982762549173200183231107234237191142140251187172202186184175238160188141197151283201289264212244250247232287265274230269246245275249233266207216226261259206210263229208248158286273211268225218228227262215260220209196282214219217223290

HSSVD: Variance Approximation

−2 0 1 2Value

Color Key

265

183

356 89 337 52 68 111

326

199 88 343 47 178

251

261

214

166 48 99 306

275

331

269

317

321

175

237 70 172

345

193

348

242 13 119

146

203 49 33 366 23 56 290 2

205

210

314

228

311 32 213

145

127

243

164

148

118 20 188

155

218

328

117

201

141

240 72 303

279

310 87 50 252

179 26 4 176

157

349

291 37 65 198

102 71 262

372 43 91 153

255 83 301

375

282 57 130 51 190

158

150

300

294

120

165

216

152

268

342

236

109

163

215

258 54 212 27 143 41 260

274

249

128

325 94 11 135

224

220

246 62 204 63 46 139

327

332 5

106 31 359 90 121 55 184

108 73 142

151

339 25 93 377 14 379

227

160

382 61 354

202

361

206

122

253

180

330

241 67 298 74 40 364

131 19 154

302 21 194

133

207

374

209 17 376

104

254 36 384

383

378

371

370

344

323

322

299

281

280

256

247

245

233

223

208

197

182

169

162

161

156

129 98 95 92 86 82 79 78 76 59 53 3 16 232

276

107

101 96 80 305

329 12 324

283

113 66 238

126 1

177

272

167

267

115

296

132

285 77 239 22 34 147

100

365

380

347

231

340

217

137

316

226 75 289

297 38 341

367

286

144

313

110

138 84 230

105

248 7

195

270

288

363 29 352

186 9

319

373

355 69 318

266

336

192

116

211

287

112

357

278

114 10 42 358 60 369

125 18 309

308

257

362

174

351

381

103

320

277

271 44 123

307

196

273

219 28 149

292

171

222

244

350 64 124

170

304

200

293

168 35 39 335

284

264

134

221 85 81 312

259

263

140

333

185

338

368

225

250

235 8 45 360

234

346 97 181

315

295

229 30 334 58 173

159 6 24 189

353

187

136 15 191

264306957111526696733351149273122701072584025431282165113026121451028942141929015220753281352252341084813732311645196272911924011717912268503949182158144476619416828323190224235142189199166197139198183159163221253201234463728020323614323921318715616727727819175741896973826313136274205923141266877915726999286171331125689824742652321131178939282847310094656322281727621680584457281602476855542756226277260918312328425790206271124856424489618824386246287272250249180204141237205267276178147188242146151140149279200255238288254161105128129134233102136103133126111110101106118121116132120138245127109104125259248209217210208228227225211230223215229219214212226218220150148195153145181186154270162172202155241169170160285152174184192185175251173193252177171256165176

HSSVD: Mean Approximation

−1 0 0.5 1Value

Color KeyA B

Fig. 3. HSSVD approximation result for all samples. (A) Variance approximation; (B) mean approximation. Blue represents cancer samples, and pink rep-resents normal samples in the sidebar. Genes and samples are ordered by hierarchical clustering. Red represents large values, and green represents smallvalues. Only the variance approximation can discriminate between cancer and normal samples. More importantly, within the same gene, the heatmap for thevariance approximation indicates that cancer patients have larger variance than normal individuals. This result matches the conclusion in ref. 19. In addition,the cDMRs with the greatest contrast variance across cancer and normal samples are highlighted by the variance approximation, whereas the original paperdoes not provide such information.

Chen et al. PNAS | July 23, 2013 | vol. 110 | no. 30 | 12255

STATIST

ICS

GEN

ETICS

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 4,

202

0

defined as the ratio of the intersection of the estimated biclusterand the true bicluster over their union (also known as the Jac-card index). For the first two criteria, larger values indicatebetter performance. The third and fourth criteria are “overallmatrix approximation errors” for mean and variance biclusters,consisting of the scaled recovery error for the low-rank meansignal matrix ~Ξ=Ξ+ b J, computed via

Lmean�~Ξ; Ξ

�=��Ξ−~Ξ

��2F

.��Ξ��2F ;and the scaled recovery error for the low-rank variance signalmatrix logð~ΣÞ= logðΣÞ+ logðρ2JÞ, computed via

Lvar

�log

�~Σ�; log

�Σ��

=��log

�Σ1=2

�−log

�~Σ1=2

��2

F

��log�Σ1=2

��2

F;

with k_kF being the Frobenius norm.The simulated data comprise a 1000× 100 matrix with in-

dependent entries. The background entries follow a normaldistribution with mean 1 and SD 2. We denote the distribution asNð1; 22Þ, where Nða; b2Þ represents a normal random variablewith mean a and SD b. There are five nonoverlapping rectan-gular- shaped biclusters: bicluster 1, bicluster 2, and bicluster 5

are mean clusters, bicluster 3 is a mean and small variancecluster, and bicluster 4 is a large variance cluster. More precisely,bicluster 1 (size 100× 20) is generated from Nð7; 22Þ, bicluster 2(size 100× 10) is generated from Nð−5; 22Þ, bicluster 3 (size100× 10) is generated from Nð7; 0:42Þ, bicluster 4 (size 100× 20)is generated from Nð1; 82Þ, and bicluster 5 (size 100× 20) isgenerated from Nð6:8; 22Þ. The biclustering results are shown inTable 2: HSSVD and HSSVD-O can detect both mean andvariance biclusters, whereas FIT-SSVD-O and LSHM-O can onlydetect mean biclusters (where “O” stands for oracle input biclusternumber). For mean bicluster detection, all methods performedwell because the biclustering detection rates are all greaterthan 0.7. For variance bicluster detection, HSSVD and HSSVD-O deliver a similar biclustering detection rate. On average, thecomputation time of LSHM-O is about 30 times that of HSSVDand 60 times that of FIT-SSVD-O.Both FIT-SSVD and LSHM are provided with the oracle rank

as input. We also evaluated an automated rank version for thesemethods, but determined the performance was worse than thecorresponding oracle rank version (results not shown). Note thatthe input data are standardized to mean 0 and SD 1 element-wisely for FIT-SSVD-O and LSHM-O. Although this step is notmentioned in the original papers (9, 10), this simple procedure iscritical for accurate mean bicluster detection. From Table 2, wecan see that HSSVD-O provides the best overall performance,while HSSVD is close to the best; however, in practice, the oraclerank is unknown. For this reason, HSSVD is the only fully auto-mated approach which delivers robust mean and variance de-tection in the present of unknown heterogeneous residual varianceamong those considered.

Conclusion and DiscussionIn this paper, we introduced HSSVD, a statistical framework and itsimplementation, to detect biclusters with potentially heterogeneous

HSSVD FIT-SSVD

LSHM SVD

case

s

case

sca

ses

case

s

genes

genes genes

genes

Fig. 4. Checkerboard plots for four methods. We plot the rank-three approximation for each method. Within each image, samples are laid in rows, andgenes are in columns. We order the samples by subtype for all images (top to bottom: carcinoid, colon, normal, and small cell), and different subtypes areseparated by white lines. Genes are sorted by the estimated second right singular vector ðu2Þ, and we only included genes that are in the support (defined inTable 1). Across all methods, the HSSVD and FIT-SSVD methods provide the clearest block structure reflecting biclusters.

Table 1. Cardinality of union support of the first three singularvectors for different methods applied on lung cancer data

Support HSSVD FIT-SSVD LSHM SVDS3

i= 1kuik0 4,689 4,686 4,655 12,625S3i=1kvik0 56 56 56 56


Dow

nloa

ded

by g

uest

on

Feb

ruar

y 4,

202

0


variances. Compared with existing methods, HSSVD is both scaleinvariant and rotation invariant (as the quantity for scaling is thesame for all matrix entries and does not vary by row or column).HSSVD also has the advantage of working on the log scale(Materials and Methods) in estimating the variance components: thelog scale makes detection of low-variance (less than 1) biclusterspossible, and any traditional SSVDmethod can be naturally used inour variance detection steps. This method confirms the existence ofmethylation hypervariability in the methylation data example. Al-though we use the FIT-SSVDmethod in our implementation, otherlow-rank matrix approximation methods are applicable. Moreover,the software implementing our proposed approach was compu-tationally comparable to the other approaches we evaluated.A potential shortcoming of SVD-based methods is their in-

ability to detect overlapping biclusters. We investigate thisproblem in the first paragraph of SI Materials and Methods. Weshow that our method can serve as a denoising process foroverlapping bicluster detection. In particular, we can first applythe HSSVD method on the raw data to obtain the mean ap-proximation. Then we can apply a suitable approach, such as thewidely used plaid model (16, 23), on the mean approximation todetect overlapping biclusters. This combined procedure improveson the performance of the plaid model when the overlappingbiclusters have heterogeneous variance. Hence, our methodremains useful in the present of overlapping biclusters.Another potential issue for HSSVD is the question of whether

a low-rank mean approximation plus a low-rank variance ap-proximation could be alternatively represented by a higher-rankmean approximation. In other words, is it possible to detectvariance biclusters through mean biclusters only, even thoughthe mean clusters that form the variance clusters would bepseudomean clusters? A detailed discussion of this issue can befound in the second paragraph of SI Materials and Methods. Ourconclusion is that the variance detection step in HSSVD isnecessary for the following two reasons: First, pseudomeanbiclusters are completely unable to capture small variancebiclusters. Second, although pseudomean biclusters are able tocapture some structure from large variance biclusters, suchstructure is much less accurate than that provided by HSSVD,and can be confounded with one or more true mean biclusters.Although HSSVD works well in practice, there are a number

of open questions that are important to address in future studies.For example, it would be worthwhile to modify the method toallow nonnegative matrix approximations to better handle countdata such as next-generation sequencing data (RNA-seq). Ad-ditionally, the ability to incorporate data from multiple “omic”platforms is becoming increasingly important in current bio-medical research, and it would be useful to extend this workto simultaneous analysis of methylation, gene expression, andmicroRNA data.

Materials and MethodsModel Assumptions for HSSVD. We define biclusters as subsets of the datamatrix which have the same mean and variance. We assume that there existsa dominate null cluster in which all elements have a common mean andvariance and that all other biclusters are restricted to rectangular structureswhich have either a distinct mean or variance compared with the null cluster.We can also express our model in the framework of a random effect modelwherein

X=Ξ+ ρ2Σ×Φ+bJ; [2]

where X and Ξ are the same structures given in the traditional model 1, andwhere we require Φ, an n×p matrix, to have i.i.d. random components withmean 0 and variance 1. Moreover, the “×” in [2] is defined element-wisely:see the next section for details. Added components in the model includeΣ= ðσijÞ, an n×p matrix representing the heterogeneous variance signal;Jn×p, an n×p matrix with all values equal to 1; ρ, a finite positive numberserving as a common scale factor; and b, a finite number serving as a com-mon location factor. We also make the sparsity assumption that the majorityof ðξijÞ values are 0 and the majority of ðσijÞ values are 1. Further, just as weassumed for the mean structure Ξ, we also assume that the variance struc-ture Φ is low rank.

From the definitions, the traditional model 1 is a special case of our model2, with b= 0, Σ= J, and ρ= 1. The presence of b and ρ in the model allows thecorresponding method to be scale invariant, while the presence of Σ enablesus to incorporate heterogeneous variance signals.

HSSVD Method. We propose HSSVD based on the model 2 with a hierarchicalstructure for signal recovery. First, we properly scale the matrix elements tominimize false detection of pseudomean biclusters which can arise as arti-facts of high-variance clusters. This motivates us to add the quadraticrescaling step in the procedure. Then we can detect mean biclusters basedon the scaled data and later detect variance biclusters based on the loga-rithm of the squared residual data after subtracting out the mean biclusters.The quadratic rescaling step works well in practice, as shown in the simu-lation studies and data analysis. The pseudocode for the algorithm is pro-vided as follows:

1. Input step: Input the raw data matrix Xorigin. Standardize Xorigin (treateach cell as i.i.d.) to have mean 0 and variance 1. Denote the overall meanof Xorigin as μ and the overall SD as σ, and let the standardized matrix bedefined as X= ðXorigin − μJÞ=σ.

2. Quadratic rescaling: Apply SSVD on X2 − J to obtain the approximationmatrix U.

3. Mean search: Let Y=X=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

p, where c is a small nonpositive con-

stant to ensure thatffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

pexists. Then, apply SSVD on Y to obtain

the approximation matrix ~Y.

4. Variance search: Let Zorigin = logðX− ~Y ×ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+J− cJ

p Þ2, center Zorigin to havemean 0, and denote the centered version as Z. Perform SSVD on Z toobtain the approximation matrix ~Z.

5. Background estimation: Let P= fpijg denote the n×p matrix of indicatorsof whether the corresponding cells belong to the background cluster,with pij =1 if both ~Yij =0 and ~Zij =0, and pij =0 otherwise. Based on theassumption that most elements in the matrix should be in the null cluster,

Table 2. Comparison of four methods in the simulation study

Criteria HSSVD HSSVD-O FITSSVD-O LSHM-O

Lmean 0.013 (0.01) 0.013 (0.01) 0.081 (0.01) 0.019 (0.01)Lvar 0.157 (0.03) 0.156 (0.03) NA NASparsity 0.950 (0.04) 0.950 (0.03) 0.988 (0.02) 0.997 (0.01)BLK1 (mean) 0.861 (0.10) 0.862 (0.10) 0.818 (0.08) 0.872 (0.08)BLK2 (mean) 0.934 (0.18) 0.936 (0.17) 0.939 (0.18) 0.976 (0.01)BLK3 (mean) 0.972 (0.10) 0.974 (0.10) 0.971 (0.11) 0.987 (0.01)BLK5 (mean) 0.977 (0.11) 0.948 (0.11) 0.977 (0.11) 0.996 (0.01)BLK3 (var) 0.977 (0.02) 0.977 (0.02) NA NABLK4 (var) 0.628 (0.25) 0.633 (0.24) NA NA

Lmean and Lvar measure the difference between the approximated signal and the true signal, and so smaller is better. For the other measures of accuracy ofbicluster detection, the larger the better. The rows BLK1 to BLK5 represent the biclustering detection rate for each bicluster.“-O” indicates that the oraclerank is provided.

Chen et al. PNAS | July 23, 2013 | vol. 110 | no. 30 | 12257

STATIST

ICS

GEN

ETICS

Dow

nloa

ded

by g

uest

on

Feb

ruar

y 4,

202

0

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1304376110/-/DCSupplemental/pnas.201304376SI.pdf?targetid=nameddest=STXT

http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1304376110/-/DCSupplemental/pnas.201304376SI.pdf?targetid=nameddest=STXT

we can estimate b with 1′ðXorigin ×PÞ11′P1 and ρ with 1′ðXorigin ×P− bPÞ21

1′P1− 1 , where 1 isa vector with all elements equal to 1.

6. Scale back: Define P1 = fpijg, with pij = 1 if ~Yij =0, pij =0 otherwise. Simi-larly, define P2 = fpijg, with pij = 1 if ~Zij = 0, pij = 0 otherwise. The meanðΞ+bJÞ approximation is computed with σð~Y ×

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

p Þ+ μðJ−P1Þ+bP1, and the variance ðρ2ΦÞ approximation is computed with ½ρ2P2 + σ2

ðJ−P2Þ�× expð~ZÞ.

The operators ×, /, expðÞ,logðÞ, expðÞ, minðÞ, and ffiffiffiffiffiffið Þpused above are

defined element-wisely when they are applied to the matrix, e.g.,Un×p ×Vn×p = ðuijvijÞ. In all steps involving SSVD, we implement the FIT-SSVDmethod (10). We use FIT-SSVD because it is computationally fast and hassimilar or superior performance compared with other competing methodsunder the homogeneous variance assumption (10). The matrix

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU+ J− cJ

pprovides a working variance level estimate of the data and makes ourmethod more robust. Note that the reason for working on the log scale forthe variance detection is twofold. First, working on the log scale makes thedetection of the deflated variance (less than 1) bicluster possible. Intuitively,as variance measures deviance from the mean, we can work on the squaredresiduals to find the variance structure. For the deflated variance biclustersetting, if the mean structure is estimated correctly, the residuals within thebicluster are close to zero. The SSVD-based methods shrink the small non-zero elements to zero to achieve sparsity. As a result, if we work on thesquared residuals directly, the SSVD based methods will fail to detect the lowvariance structure. Second, to use the well-established SSVD method in thevariance detection steps we need to work on the log scale. To see this, wecan rewrite the equation in [2] as logðX−Ξ−bJÞ2 = logðΣ2Þ+ logðρ2Φ2Þ,which is similar to the model in [1]. Consequently, we can apply any methodswhich are applicable to [1] in our variance detection step if we work on thelog scale and Φ is low rank. We also want to point out that results obtaineddirectly from FIT-SSVD are relative to the location and scale of the

background cluster. In addition, we have scaled the data in the “input step.”To provide a correct mean and variance approximation of the original data,we need the “scale back” step. Assuming that the detection of null clusters isclose to the truth, then the pooled mean and variance estimates based onelements exclusively from the identified null cluster (b and ρ) are more ac-curate than estimates based on all elements of the matrix (μ and σ). Asa result, we need to use the comprehensive formula proposed in the scaleback step.

The FIT-SSVDmethod, as well as any other SVD-based method, requires anapproximation of the rank of the matrix (which is essentially the number oftrue biclusters) as input. We adapt the bicross validation method (BCV) by ref.24 for rank estimation, and we notice that in some cases the rank isunderestimated. For this reason, we introduce additional steps followinga BCV rank estimation of rank k: First, we approximate the data witha sparse matrix Xk+1

(rank = k + 1), where Xk+1=Pk+1

j=1 dj uj vTj . Define the

proportion of variance explained by the top i rank sparse matrix asRi =

Pij=1d

2

j =Pk+1

j=1 d2

j (25). Ri is between 0 and 1 and is increasing with i, andwe believe that the redundant components of the sparse matrix should notcontribute much to the total variance. The final rank estimation for HSSVD isthe smallest integer rwhich satisfies Rr > 0:95, and 1≤ r ≤ k+ 1. Note that FIT-SSVD (10) used the modified BCV method for rank estimation; however, theauthors require that most rows (the whole row) and most columns (thewhole column) are sparse, which appears to be too restrictive. In practice,this assumption is violated if the data are block diagonal or have certainother commonly assumed data structures. For this reason, we use the orig-inal BCV method as our starting point.

ACKNOWLEDGMENTS. The authors thank the editor and two referees forhelpful comments. The authors also thank Dr. Dan Yang for sharing part ofher code. This work was supported in part by Grant P01 CA142538 from theNational Institutes of Health.

1. Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. (Springer,New York), Vol 1, pp 460–462.

2. Witten DM, Tibshirani R (2010) A framework for feature selection in clustering. J AmStat Assoc 105(490):713–726.

3. Ma Z (2013) Sparse principal component analysis and iterative thresholding. AnnStatist 41(2):772–801.

4. Shen H, Huang J (2008) Sparse principal component analysis via regularized low rankmatrix approximation. J Multivariate Anal 99(6):1015–1034.

5. Zou H, Hastie T, Tibshirani R (2006) Sparse principal component analysis. J ComputGraph Statist 15(2):265–286.

6. Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: A survey onsubspace clustering, pattern-based clustering, and correlation clustering. ACM TransKnowl Discov Data 3(1):1–58.

7. Cheng Y, Church GM (2000) Biclustering of expression data. Proc Int Conf Intell SystMol Biol 8:93–103.

8. Busygin S (2008) Biclustering in data mining. Comput Oper Res 35(9):2964–2987.9. Lee M, Shen H, Huang JZ, Marron JS (2010) Biclustering via sparse singular value

decomposition. Biometrics 66(4):1087–1095.10. Yang D, Ma Z, Buja A (2011) A sparse SVD method for high-dimensional data. arXiv:

1112.2433.11. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with

applications to sparse principal components and canonical correlation analysis. Bio-statistics 10(3):515–534.

12. Hoff PD (2006) Model averaging and dimension selection for the singular value de-composition. J Am Stat Assoc 102(478):674–685.

13. Johnstone IM, Lu AY (2009) On consistency and sparsity for principal componentsanalysis in high dimensions. J Am Stat Assoc 104(486):682–693.

14. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc, B58(1):267–288.

15. Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracleproperties. J Am Stat Assoc 96(456):1348–1360.

16. Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Statist Sinica 12:61–86.

17. Shabalin AA, Weigman VJ, Perou CM, Nobel AB (2009) Finding large average sub-matrices in high dimensional data. Ann Appl Stat 3(3):985–1012.

18. Allen GI, Grosenick L, Taylor J (2011) A generalized least squares matrix de-composition. arXiv: 1102.3074.

19. Hansen KD, et al. (2011) Increased methylation variation in epigenetic domains acrosscancer types. Nat Genet 43(8):768–775.

20. Irizarry RA, et al. (2009) The human colon cancer methylome shows similar hypo- andhypermethylation at conserved tissue-specific CpG island shores. Nat Genet 41(2):178–186.

21. Liu Y, Hayes DN, Nobel A, Marron JS (2008) Statistical significance of clustering forhigh-dimension, low-sample size data. J Am Stat Assoc 103(483):1281–1293.

22. Bhattacharjee A, et al. (2001) Classification of human lung carcinomas by mRNA ex-pression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA98(24):13790–13795.

23. Turner H, Bailey T, Krzanowski W (2005) Improved biclustering of microarray datademonstrated through systematic performance tests. Comput Stat Data Anal 48(2):235–254.

24. Owen AB, Perry PO (2009) Bi-cross-validation of the SVD and the nonnegative matrixfactorization. Ann Appl Stat 3(2):564–594.

25. Allen GI, Maleti�c-Savati�c M (2011) Sparse non-negative generalized PCA with appli-cations to metabolomics. Bioinformatics 27(21):3029–3035.


Dow

nloa

ded

by g

uest

on

Feb

ruar

y 4,

202

0


Biclustering with heterogeneous variance - PNAS · Biclustering with heterogeneous variance Guanhua Chena, Patrick F. Sullivanb,c, and Michael R. Kosoroka,d,1 Departments of aBiostatistics,

Documents