Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang * Min-Dian Li † Michael I. Jordan ‡ Han Liu § January 1, 2015 Abstract Characterizing the functional relevance of transcription factors (TFs) in different biological contexts is pivotal in systems biology. Given the massive amount of genomic data, computa- tional identification of TFs is often necessary to generate new hypotheses for experimentalists. In this paper, we use large gene expression and chromatin immunoprecipitation (ChIP) data corpuses to conduct high-throughput TF-biological context association analysis. This work makes two contributions: (i) From a methodological perspective, we propose a unified topic modeling framework for exploring and analyzing large and complex genomic datasets. Under this framework, we develop new statistical optimization algorithms and semiparametric theo- retical analysis which are also applicable to a variety of large-scale data analyses. (ii) From a scientific perspective, our method provides an informative list of new discoveries in biology. Our data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types, suggesting the important role of SUZ12-mediated histone methylation in tumor biology. 1 Introduction A fundamental goal of systems biology and functional genomics is to understand global regulation of gene expression. Transcription factors (TFs), cofactors, and epigenetic proteins represent major regulators of gene expression, the disturbance of which contributes to the pathogenesis of a plethora of human diseases, including cancer (Darnell, 2002; Arrowsmith et al., 2012). A major approach * Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; e-mail: [email protected]† Department of Cellular and Molecular Physiology, Section of Comparative Medicine and Program in Integrative Cell Signaling and Neurobiology of Metabolism, Yale University School of Medicine, New Haven, CT 06520, USA; e-mail: [email protected]‡ Department of EECS and Statistics, University of California, Berkeley, CA 94720, USA; e-mail: [email protected]§ Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; e-mail: [email protected]1
31
Embed
Mining Massive Amounts of Genomic Data: A Semiparametric ... · Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach Ethan X. Fang Min-Dian Liy Michael
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Mining Massive Amounts of Genomic Data: A
Semiparametric Topic Modeling Approach
Ethan X. Fang∗ Min-Dian Li† Michael I. Jordan‡ Han Liu§
January 1, 2015
Abstract
Characterizing the functional relevance of transcription factors (TFs) in different biological
contexts is pivotal in systems biology. Given the massive amount of genomic data, computa-
tional identification of TFs is often necessary to generate new hypotheses for experimentalists.
In this paper, we use large gene expression and chromatin immunoprecipitation (ChIP) data
corpuses to conduct high-throughput TF-biological context association analysis. This work
makes two contributions: (i) From a methodological perspective, we propose a unified topic
modeling framework for exploring and analyzing large and complex genomic datasets. Under
this framework, we develop new statistical optimization algorithms and semiparametric theo-
retical analysis which are also applicable to a variety of large-scale data analyses. (ii) From a
scientific perspective, our method provides an informative list of new discoveries in biology. Our
data-driven analysis of 38 TFs in 68 tumor biological contexts identifies functional signatures
of epigenetic regulators, such as SUZ12 and SET-DB1, and nuclear receptors, in many tumor
types. In particular, the TF signature of SUZ12 is present in a broad range of tumor types,
suggesting the important role of SUZ12-mediated histone methylation in tumor biology.
1 Introduction
A fundamental goal of systems biology and functional genomics is to understand global regulation
of gene expression. Transcription factors (TFs), cofactors, and epigenetic proteins represent major
regulators of gene expression, the disturbance of which contributes to the pathogenesis of a plethora
of human diseases, including cancer (Darnell, 2002; Arrowsmith et al., 2012). A major approach
∗Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA;
e-mail: [email protected]†Department of Cellular and Molecular Physiology, Section of Comparative Medicine and Program in Integrative
Cell Signaling and Neurobiology of Metabolism, Yale University School of Medicine, New Haven, CT 06520, USA;
e-mail: [email protected]‡Department of EECS and Statistics, University of California, Berkeley, CA 94720, USA; e-mail:
[email protected]§Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA;
Figure 1: (A) Our method integrates datasets arising from gene expression and ChIPx. (B) We assess whether top
target genes (red) have significant overlap with topic genes (purple). (C) We systematically explore the associations
between biological contexts and transcription factors. The current state of the art is that only a small proportion
(red) of the joint ChIPx and expression data in human has been investigated; we analyze the unexplored area (grey)
in order to guide biologists in the design of new experiments.
several nuclear receptors, e.g., HNF4A, and RXRA, exhibit significant relevance in a wide spectrum
of tumor types.
2 Data and Methodology
We exploit a gene expression dataset (McCall et al., 2011) consisting of n = 13, 182 samples of
M = 2, 631 biological contexts generated from Affymetrix Human 133A (GPL96) arrays. The data
was downloaded from GEO, preprocessed and normalized using frozen-RMA (McCall et al., 2010) to
reduce batch effect. For each probeset, we standardize its expression values to have zero mean and
unit standard deviation across all array samples. The data contain 20,248 probes, corresponding
to d = 12, 704 genes.
3
2.1 Data Modeling
The gene expression data X ∈ Rn×d is highly heterogeneous since it is collected from multiple
biological contexts and labs. Such heterogeneity invalidates the classical Gaussian model and
motivates us to adopt a more flexible model based on the transelliptical distribution (Han and Liu,
2012).
A random vector X = (X1, ..., Xd)T ∈ Rd follows a transelliptical distribution, denoted X ∼
TE(µ,Σ;Z, f1, ..., fd), if there exist monotone univariate functions f1, ..., fd : R→ R such that the
transformed data f(X) =(f1(X1), ..., fd(Xd)
)Tfollows an elliptical distribution with mean µ and
covariance matrix Σ. More details regarding this distribution are provided in Appendix A.
To model the heterogeneity of the gene expression data X, we assume the expression data
from the m-th biological context are generated from a transelliptical random vector Xm. This
results in a transelliptical mixture model, i.e., each gene expression sample is generated from X ∼∑Mm=1 πmXm ∈ Rd where M is the total number of biological contexts and
∑Mm=1 πm = 1.
The transelliptical mixture model has a natural hierarchical interpretation (Liu et al., 2012).
Specifically, for each biological context m, we assume that there exists a latent Gaussian random
vector Ym ∼ Nd(µm,Σm). As shown in Figure 2, the Gaussian random vector can be converted
into an elliptical random vector Zm ∼ ECd(g,µm,Σm) via a global stochastic scaling factor ξm.
Compared to the Gaussian distribution, elliptical distributions are powerful at modeling heavy-tail
distributions with possibly nontrivial tail dependency. However, elliptical distributions are still re-
strictive since they must be symmetric. The elliptical random vector can be further converted into
a possibly asymmetric transelliptical random vector through marginal monotone transformations.
The transelliptical model is semiparametric since it contains both finite-dimensional parameters
(the mean and covariance matrix) and infinite-dimensional parameters (the stochastic scaling vari-
able and marginal transformations). Such a semiparametric architecture naturally addresses the
heterogeneity issue in modeling the expression data. For the purposes of statistical inference, we
treat the stochastic scaling factor ξm and marginal transformations as nuisance parameters and di-
rectly infer the latent means and covariance matrices µm’s and Σm’s. We define Y ∼∑Mm=1 πmYm
to be the latent Gaussian mixture random vector associated with X.
2.2 Transelliptical Topic Model
We assume the gene expression data X ∈ Rn×d can be summarized by a small number of “topic”
vectors v1,v2, ...,vT ∈ Rd with T n. This general approach has been used in many applications,
including text mining (Blei et al., 2003; Mimno, 2012), social media analysis (Purushotham et al.,
2012), image processing (Wang et al., 2009) and others (Bakalov et al., 2012; Yao et al., 2009; Shalit
et al., 2013). In particular, motivated by the approach to topic modeling based on the singular
value decomposition (Deerwester et al., 1990), we define the topics of the transelliptical mixture
random vector X to be the leading eigenvectors of the latent mean-adjusted covariance matrix
S = Σ +µµT , where Σ and µ are the covariance matrix and mean of the latent Gaussian mixture
4
EwingSarcoma
KidneyCancer
Y | = 2
N(µ2,2)
Y | = M
N(µM ,M )
h2 hM
2 M
1 2 M
Multi(1, . . . ,M )Mixturedistribution
Lighted tail
Heavy tail
Asymmetric
Expression forEwing Sarcoma
Expression forKidney Cancer
Gau
ssia
nEl
liptic
alTr
anse
llipt
ical
BreastCancer
Y | = 1
N(µ1,1)
h1
1
Expression forBreast Cancer
Figure 2: The hierarchical structure of a transelliptical mixture distribution. Each biological context
m has a underlying normal distribution Ym ∼ N(µm,Σm). Each Ym is transformed to an elliptical
random vector and then to a transelliptical random vector. The observed data are generated from
the transelliptical random vector.
random vector Y , i.e.,
S = Cov(Y ) + E(Y )E(Y T ) =M∑
m=1
πmSm, (1)
where each Sm = Σm + µmµTm.
The first term Cov(Y ) captures population-level variability, and the second term E(Y )E(Y T )
captures location information. Recall that for a positive semidefinite matrix, S ∈ Rd×d, we can
write S =∑d
i=1 λivivTi where λ1 ≥ λ2 ≥ ... ≥ λd ≥ 0 are the eigenvalues of S, and vi are the
corresponding eigenvectors, such that the best rank-k approximation of S is∑k
i=1 λivivTi for all 1 ≤
k ≤ d (Trefethen and Bau III, 1997). Thus, the leading topics provide a latent representation that
summarizes important aspects of the first- and second-order statistical structure of the distribution
of X. We additionally assume that the topics v1,v2, ...vT ∈ Rd are s-sparse; i.e., we assume at
most s of the d elements of each vt are non-zero where s d. Such sparsity assumptions have
been widely adopted in the latent variable modeling literature as a tool for addressing the curse of
dimensionality; see, e.g., Carvalho et al. (2008) and Wang and Blei (2009). The nonzero components
of the topics represent features which are important in one or more Xm’s. To summarize, the
transelliptical topic model is defined as:
5
Definition 2.1 (Transelliptical topic model). The transelliptical topic model, denoted by T (S;M, s),
is the set of distributions X ∼ ∑Mm=1 πmXm, where each Xm ∼ TEd(µm,Σm;Z, f
(m)1 , ..., f
(m)d ),
such that S =∑M
m=1 πm(Σm + µmµTm) and the first T leading eigenvectors of S are s-sparse.
Since transelliptical distributions can be heavy-tailed or asymmetric, we exploit a combination
of rank correlation (Han and Liu, 2012) and an M-estimator proposed by Catoni (2012) to estimate
the mean-adjusted covariance matrix S. For parameter estimation, we adopt the truncated power
(TPower) method (Yuan and Zhang, 2013) initialized by a semidefinite program that is known as
the Fantope Projection and Selection (FPS) method (Vu et al., 2013). More details regarding these
estimators can be found in Appendix B.
We now present a theorem which shows that our proposed method achieves the minimax optimal
rate of convergence, OP(√
(s log d)/n), for estimating the sparse topic vectors.
Theorem 2.2. Let X ∼ T (S;M, s). We assume the first T eigenvalues of S, λ1, ..., λT , have a
smallest spectral gap such that λt − λt+1 ≥ Cd for all t = 1, ..., T − 1 and Cd > 0. Denote the
estimated topics to be v1, ..., vT . Under “sign sub-Gaussian condition” (Han and Liu, 2013), with
suitable choice of tuning parameters, with probability at least 1−O(d−1), we have
‖vt − vt‖2 ≤ C ·√s log d
n, (2)
for some constant C.
Note that whenX follows a Gaussian or elliptical mixture distribution, the topics are the leading
eigenvectors of E(XXT ). To connect our topic model with existing work, suppose we have T topics
[v1, ...,vT ] = W ∈ Rd×T where each column vt ∈ Rd is a topic. We assume that the observed data
matrix X ∈ Rn×d is generated through some random combination of topics v1, ...,vT ; i.e., we
assume that the observed data matrix XT = WA where the random matrix A ∈ RT×n is generated
from some unknown distribution. In Deerwester et al. (1990), a singular value decomposition of
the observed data matrix XT = UDVT is conducted, such that if the columns of U are viewed
as the topics, then A = DV can be viewed as a random combination matrix. It is easily seen
that if d is fixed and n → ∞, the columns of U converge to the leading eigenvectors of E(XXT )
asymptotically. Thus, our definition of topics can be viewed as a generalization of that of Deerwester
et al. (1990).
Our topic modeling framework, based on a transelliptical mixture distribution, is nongenerative,
in distinction to the bulk of the literature on topic modeling, which focuses on generative models
(Blei et al., 2003; Mimno, 2012). Our topics are defined in the latent space and the transformations
to the observed data are treated as nuisance parameters; however, the topics in the latent space
can be viewed as informative summaries of the distribution of the random vector X.
2.3 TROPIC for TF-Biological Context Analysis
We now introduce the TROPIC method for conducting TF-biological context analysis. Given a
transcription factor and a biological context, we first identify the biological context’s feature genes
6
using the estimated topics from the gene expression data. Next, we exploit the ChIPx data to
identify the top target genes of the TF. We then test if the feature genes of the biological context
and the top target genes of the TF have significant overlap. If so, we conclude that the feature
genes and target genes significantly match, and the TF is deemed functionally significant in the
biological context.
In more detail, let v1, v2, ..., vT be the estimated topics from the gene expression profiles X. We
let v(m) denote the leading eigenvector of the estimated latent mean-adjusted covariance matrix
of Xm, which can also be viewed as the leading “topic” of the m-th biological context. We can
view v(m) as encoding summary information for the m-th biological context. However, the sample
size of the m-th biological context is possibly very small, which results in the instability of the
estimated v(m). To resolve this problem, we regress v(m) on the population topicsv1, v2, ..., vT
to identify a subset Sm =vm1 , ..., vmK
which explains the greatest fraction of the variability.
We then construct a binary feature vector, v(g)m , where v
(g)m (i) = 1 if there exists some k such that
vmk(i) 6= 0, and v
(g)m (i) = 0 otherwise, where v(i) denotes the i-th component of v.
We further construct a binary target gene vector u(m)j corresponding to the j-th TF. The
elements of u(m)j corresponding to the top target genes of the j-th TF are set to be 1, where we
first use CisGenome (Ji et al., 2008) to perform peak detection using the ChIPx data of the j-th
TF, and then we use ChIPXpress (Wu and Ji, 2013) to identify the top target genes of the TF.
We then test if v(g)m and u
(m)j have significant overlap. If so, we conclude that the feature genes
of the m-th biological context significantly match the target genes of the j-th TF, and infer that
the regulation of the j-th TF is functionally important in the m-th biological context. A more
detailed presentation of the protocol can be found in the Appendix C.
3 Results and Discussions
We apply the TROPIC method to the analyze the association between 38 TFs and a total of 68
tumor-related biological contexts where the sample sizes of each biological contexts are greater than
20. In this section, we discuss several important biological findings that arise from this analysis.
3.1 TROPIC Reliably Predicts TF Signature in a Conserved Cohort of Tumor
Types with ChIPx Data from Different Sources
To test the hypothesis that the adaptively selected target genes from the ChIPx data represent
the major targets of a TF, we use the TROPIC method to examine the association between major
targets of MYC and 68 sources of tumors, with ChIPx data from 6 sources, respectively. The ChIPx
data are different in the prepared laboratory and cell type. As shown in Figure 3A, ChIPx data for
MYC predicts a conserved cohort of tumor types (14/18), suggesting our selection criteria faithfully
preserves major targets of MYC regardless of the origins of the data. In particular, as shown in
Figure 3B, ChIPx data from three different cell types predicts MYC signature in 12 tumors shared
by all cell types. The cell types chosen are originally from umbilical vein endothelium (HUVEC),
lymphoblastoid tumor (GM12878), myelogenous leukemia (K562), and cervical malignant tumor
7
(HeLa), which have distinct cellular physiology. Experimental variance is another concern for
extrapolating TF function to a new biological context. We compare the outcomes of TROPIC
from two laboratories and found K562 cell-derived and GM12878 cell-derived ChIPx data predict
MYC signature in a highly overlapped cohort of tumors, 12/14 and 14/16, respectively, as shown
in Figure 3C. Together, the results indicate that our selection criteria to process ChIPx data can
reliably predict TF signatures in new biological contexts.
12
: UTA HUVEC: UTA GM12878: UTA K562
0
1
0
1
1
1
12
0
1
1
1
1
2
: Yale Hela: Yale GM12878: Yale K562
22
: Yale K562: UTA K562
12 21
: UTA GM12878: Yale GM12878
14
Yale: Hela
UTA: HUVEC
UTA: GM12878
UTA: K562
Yale: GM12878
Yale: K562
Bio
logi
cal
Con
text
ChIPSources Brea
st: Tu
mor LGLA
ALK Positive
Anaplas
tic Ly
mphoma
Breast:
Tumor S
troma
B Cell: L
ymphoma
Ewing Tumor: B
one Tumor
Breast:
Tumor P
ost-Men
opausa
l
Classic
al Hodgkin
Lymphoma
Lung: Lung Can
cer C
ell Line
Melanoma B
L: Mela
noma Cell
Line
Lympoblas
toid Cell Lines
MCF7: Brea
st Aden
ocarci
noma
Melanoma M
etasta
tic Deri
vativ
es, L
ung
Squamous C
ervica
l Epith
ellium: T
umor
Bone Marr
ow: T-A
LL
Melanoma M
etasta
tic Deri
vativ
es, S
.C
Lung: Tumor
A375 C
ell Line:
Malignan
t Mela
mona
K562 C
ell Line:
CML
A
B C
Figure 3: TROPIC predicts the TF signature in a conserved cohort of tumor types with ChIPx
data from different sources. (A) The diagram that shows significant biological contexts from 68
tumors for MYC. The horizontal panel shows significant biological contexts. The vertical panel
shows sources of ChIPx data for MYC. The red color indicates an adjusted P-value < 0.05. (B)
A Venn diagram of the number of significant biological contexts for ChIPx data from different cell
types. (C) A Venn diagram of the number of significant biological contexts for ChIPx data from
different laboratories.
3.2 TROPIC Predicts TF Signature in a Bigger Cohort of Tumor Types than
ChIP-PED
ChIP-PED is an alternative method to predict TF signatures in biological contexts where ChIPx
data are not available. To estimate the accuracy of our TROPIC method, we choose ChIPx data
for MYC and SET-DB1, which represent a TF and an epigenetic protein, and apply the TROPIC
8
method to predict associations of TFs with 68 tumors. Note that throughout the paper, we use
the FDR method (Benjamini and Hochberg, 1995) to adjust the P-values for multiple comparison.
However, for a fair comparison in Figure 4, we adjust the P-values of the two methods using
Bonferroni’s method as ChIP-PED does. By applying Bonferroni’s adjusted P-value of 0.05 as the
threshold, the results show that the tumor types predicted by ChIP-PED have significant overlap
with that predicted by the TROPIC method as shown in Figure 4. In particular, the TROPIC
method predicts MYC signature in seven tumors (Figure 4A, ChIPx source: UTA GM12878 without
MCF7) whereas ChIP-PED predicts MYC signature in a sub-cohort of four tumors. Two types
of lymphoma and K562 cell line are predicted by both methods, which is supported by previous
studies (Li et al., 2003; Slack and Gascoyne, 2011). Melanoma is another common tumor type
affected by MYC (Zhuang et al., 2008; Leonetti et al., 1996), which is predicted by our method.
Similarly, ChIP-PED predicts SET-DB1 signature in a sub-cohort of 6 tumors out of 11 predicted
by the TROPIC method as shown Figure 4B. Both TROPIC and ChIP-PED methods predict
melanoma as a significant biological context, which is consistent with a recent study (Ceol et al.,
2011). The difference is likely due to the additional assumption by ChIP-PED method, where
ChIP-PED assumes that the target genes and TF will both have significantly high/low expressions.
Meanwhile, our TROPIC method sets no threshold value for the expression level of TFs and does
not match the expression level of target genes to the expression level of TFs. It is reasonable that
altered expression of TF contributes to changes in its target genes, especially given that tumor
cells are known to show increased activity of oncogenic TFs (Darnell, 2002). However, increased
activity of TFs is not necessarily associated with increased level of expression. It is known that
chromosomal translocations and point mutations in oncogenic TFs, cofactors, or epigenetic proteins
can contribute to increased activity of TFs. In addition, decreased activity of TFs, cofactors, or
epigenetic proteins can be counted as features of the biological context by the TROPIC method,
so long as the inactivation leads to a dramatic change on target genes. This extends the power of
TROPIC to predict TF signature in a biological context that has inactivated TFs, as commonly
observed in chromosomal transcolations and truncations. In summary, the TROPIC method can
predict the TF signature regardless of the expression level and the activation status of the protein,
and thus provides a bigger cohort of tumor types for a specific TF.
3.3 TROPIC Predicts Novel Biological Contexts in Tumors
To test whether the TROPIC method is applicable to other regulators of gene expression, we
further apply the transelliptical topic modeling framework to context-specific analysis of ChIPx
data comprising 38 TFs, cofactors, and epigenetic proteins, and gene expression of 68 tumor types.
3.3.1 Epigenetic Regulators are Relevant to Many Tumor Types
Epigenetic control of gene expression is emerging as a crucial contributor to tumorigenesis and
metastasis (Suva et al., 2013). Histone methylation is an important and widespread form of epige-
netic mechanism. Emerging evidence indicates that deregulation of histone methylation contributes
to tumor formation (Martin and Zhang, 2005; Greer and Shi, 2012; Dawson and Kouzarides, 2012;
9
A375 C
ell Line:
Malignan
t Mela
mona
B Cell Progen
itor: A
LL
ALK Positive
Anaplas
tic Ly
mphoma
B Cell: L
ymphoma
TROPIC 1
Biol
ogic
alCo
ntex
t
Method
ChIP-PED 1
Bone Marr
ow: Mye
loma
Bone Marr
ow: T-A
LL
Left Frontal
Lobe: Glio
blastoma
Lympoblas
toid Cell Lines
Lung: Tumor
Melanoma M
etasta
tic Deri
vativ
e, Lung
Melanoma B
L: Mela
noma Cell
Line
Yolk sa
c Tumor: T
umor
Cervix:
Cance
r
Favorab
le Hist
ology Wilm
s Tumor: N
on-Rela
pse
Favorab
le Hist
ology Wilm
s Tumor: R
elapse
Lung: Lung Can
cer C
ell Line
Melanoma M
etasta
tic Deri
vativ
e, S.C
Classic
al Hodgkin
Lymphoma
Blood: Leu
kemia
MCF7: Brea
st Aden
ocarci
noma
Biol
ogic
alCo
ntex
tMethod
TROPIC
Breast:
Tumor L
GLA
ALK Positive
Anaplas
tic Ly
mphoma
Breast:
Tumor S
troma
B Cell: L
ymphoma
Ewing Tumor: B
one Tumor
Breast:
Tumor P
ost-Men
opausa
l
Classic
al Hodgkin
Lymphoma
Lung: Lung Can
cer C
ell Line
Melanoma B
L: Mela
noma Cell
Line
Lympoblas
toid Cell Lines
MCF7: Brea
st Aden
ocarci
noma
Melanoma M
etasta
tic Deri
vativ
es, L
ung
Squamous C
ervica
l Epith
ellium: T
umor
Bone Marr
ow: T-A
LL
Melanoma M
etasta
tic Deri
vativ
es, S
.C
Lung: Tumor
A375 C
ell Line:
Malignan
t Mela
mona
K562 C
ell Line:
CML
ChIP-PED
A
B
MYC
SET-DB1
ChIP-PED 2
TROPIC 2
Figure 4: Comparison between TROPIC and ChIP-PED. (A) Diagram that shows significant bio-
logical contexts from 68 tumors for MYC computed from TROPIC and ChIP-PED. The red square
indicates an adjusted P-value < 0.05. (B) Diagram that shows significant biological contexts from
68 tumors for SET-DB1 computed from TROPIC and ChIP-PED, where the first two rows indicate
the results from one ChIPx dataset, and the last two rows show the results from another ChIPx
dataset. The red square indicates an adjusted P-value < 0.05.
Chi et al., 2010). We include several epigenetic regulators in the TROPIC analysis and present the
results as shown in Figure 5.
SUZ12: Multiple subunits of polycomb repressive complex 2 (PRC2) that trimethylates his-
tone 3 lysine 27 are either mutated or dyeregulated in different tumors (Sparmann and van Lo-
huizen, 2006). SUZ12 is a core subunit of PRC2. Previous studies report altered expression level
of PRC2/SUZ12 in a wide range of human primary tumors, such as T cell acute lymphoblastic
leukemia (T-ALL) (Ntziachristos et al., 2012), ovarian (Li et al., 2012, 2007), metastatic prostate
(Yu et al., 2007), lung (Martın-Perez et al., 2010), melanoma (Martın-Perez et al., 2010), brain and
glial tumors (Crea et al., 2010). To test whether the SUZ12 signature is present in tumors, we apply
the TROPIC method to analyze SUZ12 in human tumor samples. The results indicate that SUZ12
signature is present in 48 out of 68 tumor samples (70.59%), including most of the reported tumor
types as shown in Figure 5. Genetic manipulation of SUZ12 results in difference in tumor prolifera-
tion in the context of ovarian cancer and mantle cell lymphoma (Li et al., 2012; Martın-Perez et al.,
2010). However, whether the function of SUZ12 in other tumor types is significant is largely un-
10
SUZ12
JUND
SETDB1
EP300
GABP
NFKB
FOS
JUN
IRF4
ESR1
RXRA
HNF4A
Biol
ogic
alCo
ntex
t
ChIPProteins Acu
te Ly
mphoblastic
Leuke
mia
B Cell Progen
itor: A
LL
ALK Positive
Anaplas
tic Ly
mphoma
B Cell Prec
ursor: A
LL
B Cell: C
hronic Ly
mphoblastic
Leuke
mia
B Cell: L
ymphoma
Bladder:
sTCC
Bladder
Tumor: T
2-4
Bladder:
mTCC
Blasts
and M
onuclear
Cells:
Leuke
mia
Blood: Acu
te Mye
loid Leuke
mia
Bone Marr
ow: Chronic
Lymphocy
tic Leu
kemia
Bone Marr
ow Mononucle
ar Cell
s: AML
Bone Marr
ow: Acu
te Ly
mphocytic
Leuke
mia
Bone Marr
ow: Leu
kemia
Bone Marr
ow: Mye
loma
Bone Marr
ow: T-A
LL
Breast:
Cance
r
Bone Marr
ow: Multip
le Mye
loma
Bone Marr
ow: Wald
enstr
oms Mac
roglobulinem
ia
Brain: G
lioblas
toma
Brain: T
umor
Breast:
Cance
r Dutal
Breast:
Tumor
Breast:
Tumor E
pitheli
um
Breast:
Tumor L
argely
Opera
ble or L
ocally
Advance
d
Breast:
Tumor L
argely
OLAI
Breast:
Tumor L
ymph Node-N
egati
ve
Breast:
Tumor P
ost-Men
opausa
l
Breast:
Tumor S
troma
Cervix:
Cance
r
Breast:
Tumor N
ode-Neg
ative
Blood: Leu
kemia
A375 C
ell Line:
Malignan
t Mela
mona
Colon: Tumor
Glioblas
toma: Tu
mor
Ewing Tumor: B
one Tumor
Germ Cell
: Tumor
Left Frontal
Lobe: Glio
blastoma
K562 C
ell Line:
Normal
Leuke
mia Cell
s: Acu
te Ly
mphoblastic
Leuke
mia
Lung: Aden
ocarci
noma
Lympoblas
toid Cell Lines
Lung: Tumor
Mammary
Glan
d: Tumor
MCF7: Brea
st Aden
ocarci
noma
Melanoma M
etasta
tic Deri
vativ
e, Lung
Melanoma B
L: Mela
noma Cell
Line
Ovaria
n Tumor: E
ndometroid
Ovary:
Cance
r
Posterio
r Foss
a:Pilo
cytic
Astrocy
toma
Skin: M
elanoma
Prostate:
Tumor
Squamous C
ell: C
arcinoma
T Cell: A
cute
Lymphoblas
tic Leu
kemia
Yolk sa
c Tumor: T
umor
Favorab
le Hist
ology Wilm
s Tumor: N
on-Rela
pse
Favorab
le Hist
ology Wilm
s Tumor: R
elapse
Lung: Lung Can
cer C
ell Line
Melanoma M
etasta
tic Deri
vativ
e, S.C
Ovaria
n Tumor: M
ucinous
Ovaria
n Tumor: S
erous
Skin: M
etasta
tic M
elanoma
Squamous C
ervica
l Epith
elium: T
umor
Liposarco
ma Cultu
re:incu
bated w
ith doxo
rubicin
Liposarco
ma Cultu
re:incu
bated w
ith PBS
Right Frontal
Lobe:Glio
blastoma
Classic
al Hodgkin
Lymphoma
SUZ12
JUND
SETDB1
EP300
GABP
NFKB
FOS
JUN
IRF4
ESR1
RXRA
HNF4A
Biol
ogic
alCo
ntex
t
ChIPProteins
Epig
enet
icpr
otei
nsHi
ppo
Path
way
Nucl
ear
Rece
ptor
Onc
ogon
icTF
Epig
enet
icpr
otei
nsHi
ppo
Path
way
Nucl
ear
Rece
ptor
Onc
ogon
icTF
PAX5
PAX5
Figure 5: Results of TROPIC on 13 TFs on 68 tumor-related biological contexts. The red square
indicates an adjusted P-value < 0.05.
11
known. In addition, traditional screening via expression profiling, somatic mutation mapping, and
knockdown underestimates the functional relevance of TFs and transcriptional regulators. It has
been reported that portions of SUZ12 are commonly fused to JAZF1 gene in normal and neoplastic
endometrial cells (Li et al., 2007, 2008). JAZF1-SUZ12 contributes to tumorigenesis independent
of the expression and sequence of SUZ12 gene but exhibits TF signature that can be identified by
the TROPIC method (Figure 5, see ovarian tumor: endometroid). Despite a large body of evidence
supporting PRC2/SUZ12 as an oncoprotein, a recent study shows PRC2/SUZ12 acts as a tumor
suppressor in T-ALL (Ntziachristos et al., 2012). Our results identify SUZ12 signature in T-ALL,
suggesting that the TROPIC method focuses on the functional significance of TFs regardless of
the positive/negative role played by TFs. Together, our data suggests that SUZ12 is an important
regulator of gene expression in a broad range of tumor types.
SET-DB1: SET-DB1 is another epigenetic regulator that methylates histone 3 lysine 9 residue
into mono-, di-, and tri-methylated form (Greer and Shi, 2012). Originally discovered in fruit flies,
mammalian SET-DB1 is involved in the maintenance of embryonic stem cells by repressing the
expression of developmental regulators (Bilodeau et al., 2009). A recent study reports that SET-
DB1 is amplified in melanoma and accelerates the onset of tumor (Ceol et al., 2011). The same
study also finds that the copy number of SET-DB1 is increased in breast, liver, lung, and ovarian
tumors. Increased copy number of a certain gene does not necessarily lead to increased activity,
but a significant representation of TF signature will support the tumor-relevant role of that gene.
To verify whether SET-DB1 signature is present in tumor samples, we include SET-DB1 in the
TROPIC analysis. The results show that SET-DB1 signature is present in 20 sources of tumors,
including melanoma, breast, and lung tumor as shown in Figure 5. In particular, melanoma and
Wilms tumors are predicted by both TROPIC and ChIP-PED to be significant biological contexts
for SET-DB1 (Figure 4B first two rows). Whether SET-DB1 is involved in tumorigenesis in Wilms
tumor awaits further studies. In addition to confirm the presence of reported tumor types, the data
suggests that SET-DB1 is an important regulator in several types of blood and solid tumors.
3.3.2 Ets Family Protein GABP is Significantly Associated with Leukemia
GABP: Hippo tumor suppressor signaling is a conserved molecular pathway for the control of
organ size and has implicated in cancer (Harvey et al., 2013; Halder and Johnson, 2011). Hippo
pathway gauges the organ size by restricting both cell growth and cell proliferation, as well as
inducing cell death. Dysregulation of Hippo signaling is observed in a broad range of human
cancers, however, somatic or germline mutations in Hippo pathway are uncommon (Harvey and
Tapon, 2007). Recently, GA-binding protein (GABP), a member of ETS transcription factor family,
has been found to drive the expression of YAP (Wu et al., 2013b), the effector TF of Hippo pathway.
Loss of GABP down-regulates the level of YAP, resulting in a block at the G1/S phase of cell cycle
and increased cell death, which establishes GABP as an important regulator of Hippo pathway. We
test whether GABP signature is associated with tumors by the TROPIC method. The results show
that GABP signature is present in 19 sources of tumors as shown in Figure 5, including melanoma,
breast, lung, and prostate tumors, with an enrichment of lymphoblatic leukemia (8/19, 42.11%). It
has been reported that activation of Hippo-YAP pathway are deregulated in solid tumors (breast,
12
lung, colorectal, and liver) (Halder and Johnson, 2011). Our analysis suggests that GABP is a
contributing pathogenic factor in breast and lung tumors through Hippo-YAP pathway.
3.3.3 Classic Oncogenic TFs are Implicated in Many Tumor Types
NF-κB and AP-1: Historically, transcription factors, such as NF-κB (RELA) and AP-1 (FOS,
JUN, JUND, etc), are among the first cohort of oncogenes. These TFs are master regulators in cell
proliferation, differentiation, survival, stress response, and inflammation, most of which represent
hallmarks of tumor cells (Li and Yang, 2011; Piette et al., 1997; Shaulian and Karin, 2002; Hanahan
and Weinberg, 2011). A large body of studies has implicated the critical role of NF-κB (RELA)
and AP-1 in lymphoma and leukemia (Eferl and Wagner, 2003; Rayet and Gelinas, 1999). We
test whether our method can reveal lymphoma and leukemia as the significant biological contexts
for those proteins. Importantly, chromosomal amplification, over-expression and rearrangement of
these genes contribute to tumorigenesis, which is likely to be filtered out by existing methods. We
apply the TROPIC method to NF-κB (RELA), FOS, JUN, and JUND. The results show that many
biological contexts of blood tumors are significant for these TFs whereas IRF4 is not significant in
most of the tumors except in myeloma and T-cell acute lymphoblastic leukemia (T-ALL) (Figure
5) as reported previously (Yoshida et al., 1999). These results demonstrate the high credibility of
TROPIC in predicting biological contexts.
ESR1: Estrogen receptor 1 (ESR1 or estrogen receptor alpha) is a classic steroid nuclear
receptor that is activated by estrogen hormone. Estrogen is a hormone that regulates the behavior
and physiology. ESR1-deficient mice are sterile with incomplete development of sex organs (Ogawa
et al., 1998; Dupont et al., 2000). ESR1 also acts in other tissues, such as bone and adipose
tissue (Heine et al., 2000; Nakamura et al., 2007). It has been reported that estrogen promotes
apoptosis of osteoblasts by ESR1 and induction of FAS death ligand (Nakamura et al., 2007).
ESR1-deficient mice are obese with increased number and size of adipose tissue (Heine et al.,
2000). ESR1 is involved in the pathogenesis of breast cancer and endometrial cancer. Expression
of ESR1 is widely used as a prognostic marker for breast cancer (Knight et al., 1977; Gruvberger
et al., 2001). The wide spectrum of physiological function for ESR1 indicates its pathogenic role is
beyond the realm of reproductive tissue-derived cancers. To test this hypothesis, we run TROPIC
analysis with ChIPx data for ESR1 and found that ESR1 is associated significantly with 30 out of 68
tumor-related biological contexts (Figure 5). As expected, breast cancer and ovarian cancer exhibit
ESR1 signature. Surprisingly, many types of B-cell (acute or chronic) lymphoblastic leukemia are
significantly associated with ESR1. It is known that estrogen promotes proliferation and survival
of B cells (Grimaldi et al., 2002; Thurmond et al., 2000). ESR1 pathway may contribute to the
pathogenesis of B-cell lymphoma and leukemia via increasing cell proliferation and survival.
PAX5: Paired box protein 5 (PAX5) is a transcription factor in B cell development and has been
implicated in several types of lymphoma (Shaffer et al., 2002). PAX5 activates a transcriptional
program of various B-cell-specific genes, which is required for directing bone-marrow progenitor
cells to differentiate into B cells (Morrison et al., 1998b). Urbanek et al. (1994) reported that loss
of PAX5 in mice leads to a complete arrest of B cell development at an early precursor stage. PAX5
is also important in the late stage of B cell differentiation. De-regulation of PAX5 is commonly ob-
13
served in several types of lymphoma in the form of chromosomal translocation. A t(9:14)(p13;q32)
chromosomal transclocation brings the potent Emu enhancer of the IgH gene (a gene expressed in
mature B cells) into close proximity of the PAX5 promoter and results in increased expression of
PAX5 in late B-cell differentiation (Busslinger et al., 1996; Iida et al., 1996; Morrison et al., 1998a).
As expected, the significant biological contexts of PAX5 include B-cell lymphoma and leukemia
(i.e. B-ALL and B-CLL) (Figure 5). PAX5 is not only a master regulator of B cell biology, but also
an important pattern organizer in the development of central nervous system and genital tracts
(Urbanek et al., 1997; Bouchard et al., 2000). However, whether PAX5 is implicated in other types
of cancers is not known. Our TROPIC analysis shows that PAX5 is associated significantly with
28 out of 68 tumor biological contexts (Figure 5), including solid tumors from brain (brain and
glial cells) and reproduction organs (ovary and bladder). These observations indicate that the
tumorigenic role of PAX5 is beyond the realm of B-cell lymphoma.
3.3.4 Nuclear Receptor RXRA and HNF4A are Broadly Implicated in Tumors
Nuclear receptors represent a superfamily of ligand-activated transcription factor that modulates
cell growth, differentiation, survival and metabolism (Mangelsdorf et al., 1995; Evans, 1988). The
ligands for nuclear receptors include hormones and metabolites, ranging from retinoic acid (RAs),
vitamin D, steroid hormones, to lipid species. Retinoid X receptor A (RXRA) recognizes 9-cis
retinoic acid (9-cis RA), and heterodimerize with other nuclear receptors to modulate cellular
function. RAs are widely explored as therapeutics for both blood and solid tumors (Altucci et al.,
2007). A presence of RXRA signature in tumors will be useful to estimate the plausibility of
RA-based therapy in that specific tumor. We examine the significant tumor contexts for RXRA
and found that more than 65% of tumor contexts (47/68) are relevant to RXRA (Figure 5). This
highlights the important roles of RXRA biology in those tumors. Similarly, another nuclear receptor
hepatocyte nuclear factor 4 A (HNF4A) is significantly associated with a broad spectrum of tumors
(33/68, 48.52%), including melanoma, leukemia, breast, cervical, lung, and ovarian tumors (Figure
5). HNF4A has long been thought as a critical regulator of metabolism and contributes to the
pathogenesis of type I diabetes. It is well known that there is a strong link between diabetes and
cancer (Gullo et al., 1994; Vigneri et al., 2009). Our results suggest that HNF4A may be a genetic
link between diabetes and cancer.
3.4 The Estimate of the True Positive Rate
To evaluate the quality of our results, we randomly select 100 pairs of our found functionally
important TF-biological pairs. Next, we search existing literatures to find if the connections between
each TF-biological context pair has been experimentally proved. In total, we find 48/100 pairs
have been explicitly verified by biologists. Furthermore, 78/100 pairs have been mentioned in
the literatures. Thus, very conservatively, the true positive rate of our results is 48-78%. This
provides strong evidence that our method is able to guide the biologists to conduct experiments
more efficiently.
14
4 Discussion
We present a semiparametric topic modeling framework to conduct high-throughput TF-biological
contexts analysis. Our approach addresses several key challenges in Big Data analysis, including
high dimensionality, distributional complexity, and data heterogeneity. Theoretically, our method
guarantees a nearly optimal rate of convergence across a wide family of possibly heavy-tailed distri-
butions. Practically, our method is computationally simple and robust to very noisy data. Consid-
ering the limited source of ChIPx data and the massive expanding pool of gene expression profiles,
TROPIC has the potential to assist in the construction of the global regulatory networks of large
numbers of genes.
One drawback of our method comparing with ChIP-PED is that our method does not reveal
the detailed regulation pattern of a TF in different biological contexts (e.g., whether this TF acts as
an activator or repressor). A natural way to address the issue is to further divide the feature genes
and target genes into different groups according to the signs of their correlations, and consider the
topics in different groups of genes.
Acknowledgement
Han Liu is supported by NSF Grants III-1116730 and NSF III-1332109, NIH R01MH102339, NIH
R01GM083084, and NIH R01HG06841, and FDA HHSF223201000072C. The authors are also grate-
ful for the host of the Simons Institute of Theory of Computation at UC Berkeley. Min-Dian Li
is supported by a scholarship from the CSC-Yale World Scholars Program and the Glenn/AFAR
Scholarship for Research in the Biology of Aging.
Appendix
A Elliptical and Transelliptical Models
In this section, we briefly review the transelliptical distribution (Han and Liu, 2012) and discuss
its relationship with the other distribution families.
We start with some notations. For a vector u = (u1, ..., ud)T ∈ Rd, the `0, `p and `∞ vector
norms are defined as ‖u‖0 := card(supp(u)), ‖u‖p := (∑d
j=1 |uj |p)1p and ‖u‖∞ := max1≤j≤d |uj |.
For a matrix A = [ajk]d×d, the `max-norm is defined as ‖A‖max := max1≤j,k≤d|Ajk|. Let Sd−1 :=
u ∈ Rd : ‖u‖2 = 1 be the d-dimensional unit sphere. For any two vectors a, b ∈ Rd and two
squared matrices A, B ∈ Rd×d, we denote their inner products by 〈a,b〉 := aTb and 〈A · B〉 :=
Tr(ATB) respectively. Throughout the Appendix, we use a generic constant C whose value may
vary from line to line.
The transelliptical model is a semiparametric distribution familiy in which the nonparametric
components provide modeling flexibility, while the parametric components encode the important
15
information we can estimate efficiently. Before describing the transelliptical distribution, we briefly
overview the elliptical model, which can be viewed as a subfamily of the transelliptical model.
Recall that a random vector X = (X1, X2, ..., Xd)T ∈ Rd is continuous if the marginal distribu-
tions of X1, ..., Xd are all continuous, and we say X possesses density if X is absolutely continuous
with respect to Lebesgue measure. The elliptical distribution is defined below.
Definition A.1. A random vector X ∈ Rd (assuming its density exists) follows an elliptical
distribution if its density is of the following form:
f(x) = c|Σ|−1/2g((x− µ)TΣ−1(x− µ)
), (3)
where µ ∈ Rd; Σ ∈ Rd×d is positive definite; g : R+ → R+ is a univariate function on [0,∞), and c
is a normalization constant. We denote X ∼ ECd(µ,Σ, g).
Remark A.2. In general, we say a random vector X ∈ Rd follows an elliptical distribution if it can
be represented as Xd= µ + ξAU , where µ ∈ Rd, A ∈ Rd×p, p ≤ d and p = rank(Σ), AAT = Σ;
ξ ≥ 0 is a random variable independent of U ; U ∈ Sp−1 is uniformly distributed on the unit sphere
in Rp. It is seen that X does not necessarily possess a density as ξ does not always possess a
density, and Σ is only assumed to be positive semidefinite which might not be of full rank. In this
paper, we restrict our discussion on elliptical distributions which possess densities.
Assume that a random vector X ∈ Rd possesses a density and a covariance matrix (i.e., the
second moments of X are finite). The next proposition (Anderson and Fang, 1990) characterize
the relationship between the matrix Σ and the covariance matrix of X.
Proposition A.3. If a random vector X ∈ Rd follows an elliptical distribution possessing density
as defined in (3), then the matrix Σ ∈ Rd×d in (3) is a scatter matrix of X, i.e., Σ is proportional
to the covariance matrix of X.
The next proposition provides a condition for (µ,Σ, g) to be identifiable for X.
Proposition A.4. Let X = (X1, ..., Xd)T ∈ Rd be a random vector. If X ∼ ECd(µ,Σ, g) is
continuous and possesses a density, then (i) Σjj > 0 for all j ∈ 1, ..., d; (ii) (µ,Σ, g) is identifiable
for X under the constraint that Σjj = Var(Xj) for all j ∈ 1, ..., d.In the sequel, we adapt the identifiability condition that Var(Xj) = Σjj for all j ∈ 1, ..., d. In
order to model more complex distributions, Han and Liu (2012) extend the elliptical family to the
more flexible transelliptical family.
Definition A.5. (Transelliptical Distribution). A continuous random vectorX = (X1, X2, ..., Xd)T
follows a transelliptical distribution, denoted by X ∼ TEd(µ,Σ;Z, f1, ..., fd), if there exist mono-
tone univariate functions f1,..., fd, such that
(f1(X1), ..., fd(Xd))T d
= Z ∼ ECd(µ,Σ, g). (4)
We further assume that each fj(·) preserves the marginal mean and variance of Xj , i.e., E(Xj) =
E(Zj) and Var(Xj) = Var(Zj), such an identifiability condition is motivated by the “normal refer-
ence rule” (i.e., the model should reduce to a Gaussian model if the data are actually Gaussian.).
We call the matrix Σ the latent covariance matrix of X.
16
Note that the definition of transelliptical distribution is slightly different from the original
definition in Han and Liu (2012), as we impose a different identifiability condition. Namely, the
aim of Han and Liu (2012) is to conduct scale-invariant PCA on the latent correlation matrix of X.
Thus, the identifiability condition in Han and Liu (2012) is that µ = 0 and the diagonal components
of Σ are all 1’s. While this form of identifiability provides ease of estimation, it loses the marginal
location and scale information. Thus we assume that E(Xj) = E(Zj) and Var(Xj) = Var(Zj).
B Estimating Leading Topics
The leading topics of the transelliptical topic model can be estimated using a combination of sparse
semidefinite programming and algorithmic statistics.
Let X ∼ ∑Mm=1 πmXm where Xm ∼ TEd(µm,Σm;Zm, f
(m)1 , ..., f
(m)d ). To conduct transellip-
tical topic analysis, we first need to estimate each µm and Σm in order to estimate the pooled
mean-adjusted covariance matrix. As the transelliptical family contains heavy-tailed and asym-
metric distributions, classical sample mean and covariance matrices do not achieve the desired rate
of convergence and new estimation procedures are needed.
B.1 Estimating the Latent Means
Let X ∼ TEd(µm,Σm;Zm, f1, ..., fd), we exploit an M-estimator proposed by Catoni (2012) to
estimate the mean of X. Let µ = (µ1, ..., µd)T . Given n independent samples x1, ...,xn of X where
each xi = (xi1, ..., xid)T , we estimate µj using the marginal data x1j , ...xnj.
The estimator is defined as follows. Suppose we want to estimate the mean of a random variable
Z. Let z1, ..., zn be n independent realizations of Z and ψ : R → R be a continuous and strictly
increasing function satisfying − log(1− z + z2/2) ≤ ψ(x) ≤ log(1 + z + z2/2).
The estimator for the mean of Z is defined as the unique value µ such that
n∑
i=1
ψ(αδ(zi − µ)
)= 0, (5)
where δ and αδ are two parameters chosen adaptively from the data. For the choices of ψ, δ and
αδ, see Catoni (2012) for more detailed discussions.
For n samples x1,x2, ...,xn independently drawn from random vector X ∈ Rd, let E(X) =
(µ1, ..., µd)T . Choosing δ = d−2/2, we exploit the estimator in (5) to estimate the marginal means,
µ1, ..., µd. Theoretically, Catoni (2012) shows that, with probability at least 1−O(d−1), we have
max1≤j≤d
|µj − µj | ≤ C√
log d
n. (6)
B.2 Estimating Latent Covariance Matrices
In order to estimate the pooled covariance matrix of X, we need to estimate the latent covariance
matrix Σm of each Xm.
17
For a transelliptically distributed random vector X ∼ TEd(µ,Σ;Z, f1, ..., fd), it is easy to see
that the sample covariance matrix is not a consistent estimator of Σ due to the transformations
f1, ..., fd. It has been demonstrated in Han and Liu (2012) that we can efficiently estimate the latent
correlation matrix, i.e., the correlation matrix of the latent Gaussian random vector Y associated
with X. More specifically, we make use of the Kendall tau correlation matrix as defined below.
Definition B.1. The sample Kendall tau correlation matrix C = [ρjk]d×d is defined as
ρjk = sin(π
2τjk
), (7)
for all j, k ∈ 1, ..., d, where τjk = 2n−1(n− 1)−1∑
1≤i<i′≤n sign((xij − xi′j)(xik − xi′k)
)if j 6= k,
and τjk = 1 otherwise.
The next proposition from Han and Liu (2012) shows that the Kendall tau correlation matrix
enjoys a parametric rate of convergence in high-dimensional setting with respect to the `max norm.
Proposition B.2. Given n independent samples x1, ...,xn of a random vector X ∈ Rd following
a transelliptical distribution X ∼ TEd(µ,Σ;Z, f1, ..., fd). Let C be the latent correlation matrix,
i.e., the correlation of matrix of the latent Gaussian random vector Y associated with X, and let
C be the Kendall tau correlation matrix introduced in (7). We have, with probability at least
1−O(d−1),
‖C−C‖max ≤ C√
log d
n, (8)
where C is a generic constant which does not depend on d and n.
In our application, we need to estimate the latent covariance matrix. A direct approach is to
use the relationship between the correlation matrix C and the covariance matrix Σ that
Σjk = ρjkσjσk, where each σj is the marginal standard deviation of Xj .
Next, we construct an estimator for the marginal standard deviations based on (5). Given
n samples z1, ..., zn independently drawn from random variable Z, we first estimate the marginal
mean by the estimator defined in (5). Then, we use the same estimator to estimate the mean of
Z2 using z21 , ..., z2n by (5). Denoting the estimated mean of Z and Z2 by µ and M respectively, we
construct an estimator of the standard deviation of Z by
σ :=
√max
M − µ2, ε
, where ε > 0 is a small positive number. (9)
Denote the estimated marginal standard deviations of X1, ..., Xd by σ1, ..., σd. It is easy to see that,
with probability at least 1−O(d−1),
|σj − σj | ≤ C√
log d/n for all j. (10)
Combining the M-estimator for the standard deviations with the Kendall tau correlation matrix
defined in (7) gives us a covariance matrix estimator Σ = [Σjk]d×d, where
Σjk = σj σkρjk, for all 1 ≤ j, k ≤ d, (11)
18
where ρjk is the Kendall tau correlation defined in (7). We will show in the next sections that this
estimator of covariance matrix enjoys a parametric rate of convergence in the family of transelliptical
distributions and is robust in more complex settings.
After estimating the latent mean-adjusted covariance matrix and mean of eachXm, we estimate
the pooled mean-adjusted covariance matrix S. Suppose that for each m = 1, ...,M , we have smsamples of Xm. Let S =
∑Mm=1 sm. The estimator for the pooled latent mean-adjusted covariance
matrix S is constructed as
S =
M∑
m=1
smS
(Σm + µTmµ
Tm
), (12)
where the Σm and µm are estimated by (11) and (5) respectively, and µ =∑M
m=1smS µm. It follows
immediately that, with probability at least 1−O(d−1),
‖S− S‖max ≤ C√
log d/n. (13)
B.3 Estimating Leading Topics
As we have discussed in Section 2.2, the topics of the random vectorX ∼∑Mm=1 πmXm, where each
Xm ∼ TEd(µm,Σm;Z, f(m)1 , ..., f
(m)d ), are defined as the leading eigenvectors of the pooled latent
mean-adjusted covariance matrix S defined in (1). We further assume that the leading eigenvectors
v1,...,vT are s-sparse, i.e., ‖vt‖0 ≤ s for each t = 1, . . . , T .
Given the estimators for the latent-mean covariance matrices Σm defined in (11), we first analyze
the concentration of the spectral norm of Σm−Σm, where Σ is the covariance matrix of the latent
Gaussian mixture random vector Ym associated with Xm.
Theorem B.3. Given n i.i.d samples x1, ...,xn of random vector X ∈ Rd where X follows a
transelliptical distribution, i.e., X ∼ TEd(µ,Σ;Z, f1, ..., fd), let σ = (σ1, ..., σd)T be the estimated
marginal standard deviations derived from Cantoni’s estimator and C = [ρjk]d×d be the estimated
Kendall tau correlation matrix defined in (7), and let D = diag(σ). Let Σ = DCD. Under “sign
sub-Gaussian condition” (Han and Liu, 2013), We have, with probability at least 1−O(d−1)
‖Σ−Σ‖2 ≤ C√d log d
n, (14)
where C is a constant.
Furthermore, let η(Σ−Σ, s) = supv∈Sd−1∩B0(s) vT (Σ−Σ)v, with probability at least 1−O(d−1),