Page 1
Multi-task Quantile Regression under the Transnormal
Model
Jianqing Fan, Lingzhou Xue and Hui Zou
Princeton University, Pennsylvania State University and University of Minnesota
Abstract
We consider estimating multi-task quantile regression under the transnormal model,
with focus on high-dimensional setting. We derive a surprisingly simple closed-form
solution through rank-based covariance regularization. In particular, we propose the
rank-based `1 penalization with positive definite constraints for estimating sparse co-
variance matrices, and the rank-based banded Cholesky decomposition regularization
for estimating banded precision matrices. By taking advantage of alternating direction
method of multipliers, nearest correlation matrix projection is introduced that inherits
sampling properties of the unprojected one. Our work combines strengths of quan-
tile regression and rank-based covariance regularization to simultaneously deal with
nonlinearity and nonnormality for high-dimensional regression. Furthermore, the pro-
posed method strikes a good balance between robustness and efficiency, achieves the
“oracle”-like convergence rate, and provides the provable prediction interval under the
high-dimensional setting. The finite-sample performance of the proposed method is
also examined. The performance of our proposed rank-based method is demonstrated
in a real application to analyze the protein mass spectroscopy data.
Key Words: Copula model; Optimal transformation; Rank correlation; Cholesky decompo-
sition; Quantile regression; Prediction interval; Alternating direction method of multipliers.
1 Introduction
Consider a multi-task high-dimensional learning paradigm that independent variables z =
(z1, . . . , zp0)′ are used to simultaneously predict multiple response variables y = (y1, . . . , yq0)
′,
1
Page 2
where both dimensions p0 and q0 can be of larger order of magnitude than sample size n.
In this work, we are interested in estimating optimal transformations t = (t1, . . . , tq0)′ such
that t(z) = (t1(z), . . . , tq0(z))′ optimally predict y at the same time. Namely, we shall solve
optimal transformations from
mint: Rp0 7→Rq0
q0∑j=1
E[L(yj − tj(z))],
where L(·) is a convex loss function. If z and y have a joint normal distribution, it is
appropriate to specify L(·) as the squared loss, i.e., L(u) = u2. With the squared loss,
normality makes the neat connection between optimal transformations and ordinary least
squares. Thanks to normality, nice properties such as linearity and homoscedasticity hold
for optimal transformations. Thus, optimal transformations can be easily solved by ordinary
least squares.
However, observed data are often skewed or heavy-tailed, and rarely normally distributed
in real-world applications. Transformations are commonly used in reality to achieve normal-
ity in regression analysis, such as the celebrated Box-Cox transformation. Under the classical
low-dimensional setting, estimating transformations for regression has received considerable
consideration in the statistical literature. On one hand, parametric methods are proposed by
focusing on the parametric families of transformations, for example, Box & Tidwell (1962),
Box & Cox (1964), and others. On the other hand, nonparametric estimation of regression
transformations are also studied, for instance, projection pursuit regression (Friedman &
Stuetzle 1981), alternating conditional expectation (Breiman & Friedman 1985), additivity
and variance stabilization (Stone 1985, Tibshirani 1988), among others. Although these
methods work well under the classical setting, it is nontrivial to extend them to estimate
optimal transformations in high dimensions. Under the high-dimensional setting, such para-
metric and nonparametric methods would suffer from the curse of dimensionality. Therefore,
there are significant demands for relaxing normality when estimating optimal transforma-
tions for high-dimensional regression.
As a nice combination of flexibility and interpretability, Gaussian copulas have generated
a lot of interests in statistics and econometrics. The semiparametric Gaussian copula model
is deemed as a favorable alternative to the Gaussian model in several high-dimensional
statistical problems, including linear discriminant analysis (Lin & Jeon 2003, Mai & Zou
2012), quadratic discriminant analysis (Fan et al. 2013), graphical modeling (Liu et al.
2009, 2012, Xue & Zou 2012), covariance matrix estimation (Xue & Zou 2014), and prin-
2
Page 3
cipal component analysis (Han & Liu 2012). The semiparametric Gaussian copula model
provides a semiparametric generalization of the Gaussian model by assuming the existence
of univariate monotone transformations f = (f1, · · · , fp) for x = (x1, . . . , xp)′ such that
f(x) = (f1(x1), . . . , fp(xp)) ∼ Np(µ?,Σ?). Throughout this paper, we follow Lin & Jeon
(2003) to call this copula model as the transnormal model, which is also called the nonpara-
normal model in Liu et al. (2009). In this work, we will show the power of the transnormal
model in estimating optimal transformations for high-dimensional regression modeling.
We now suppose that x = (z′,y′)′ consists of z = (x1, . . . , xp0)′ and y = (xp0+1, · · · , xp0+q0=p)
′.
The transnormal model entails the existence of monotone transformations f = (g,h) =
(f1, . . . , fp0 , fp0+1, . . . , fp0+q0) such that
f(x) =
(g(z)
h(y)
)∼ Np
(µ? =
(µ?z
µ?y
),Σ? =
(Σ?zz Σ?
zy
Σ?yz Σ?
yy
))(1)
where we may assume that µ? = 0 and Σ? is a correlation matrix. Remark that the marginal
normality is achieved by transformations, so the transnormal model essentially assumes the
joint normality of marginally normal-transformed variables. The transnormal model strikes a
better balance between model robustness and interpretability than the normal model. In this
work, our aim is to estimate transformations of predictors, t(z) : Rp0 7→ Rq0 , to optimally
predict multiple response variables y under the transnormal model (1).
Given the underlying transformations g and h, we can easily derive the explicit condi-
tional distribution of h(y) given g(z), namely
h(y) | g(z) ∼ Nq0
(Σ?yz(Σ
?zz)−1 · g(z), Σ?
yy −Σ?yz(Σ
?zz)−1Σ?
zy
). (2)
But, since g and h are unknown, it is challenging to obtain the explicit conditional distri-
bution of y given z. Unlike the optimal transformations under the Gaussian model, nice
properties such as linearity and homoscedasticity do not hold for the optimal transforma-
tions under the transnormal model. In the presence of nonlinear transformations g and h,
it is important to take into account the coordinatewise nonlinearity among the conditional
distributions of y given z. We follow the same spirit of quantile regression (Koenker 2005)
to effectively deal with such nonlinearity.
Quantile regression is first introduced by the seminal paper of Koenker & Bassett (1978),
and since then, it has received much attention in many topics such as survival analysis
(Koenker & Geling 2001), time series analysis (Koenker & Xiao 2006), growth chart analysis
(Wei & He 2006, Wei et al. 2006), microarray analysis (Wang & He 2007), variable selection
3
Page 4
(Zou & Yuan 2008, Bradic et al. 2011), among others. Denote by ρτ (u) = u · (τ − Iu≤0)
the check loss function (Koenker & Bassett 1978). Then we consider multi-task quantile
regression:
mint: Rp0 7→Rq0
q0∑j=1
E[ρτ (yj − tj(z))]. (3)
The use of the `1 loss in prediction was recommended in Friedman (2001, 2002). Our work
shares the similar philosophy with Friedman (2001, 2002), and includes the `1 loss as a special
case. In fact, the problem (3) with τ = 12
reduces to median regression, namely,
mint: Rp0 7→Rq0
q0∑j=1
E[ρτ= 12(yj − tj(z))] =
1
2min
t: Rp0 7→Rq0E[‖y − t(z)‖`1 ].
In this paper, we will show that with the aid of rank-based covariance regularization, the
optimal transformations in multi-task quantile regression can be efficiently estimated under
the transnormal model. Although estimating optimal transformation is very difficult, we
surprisingly show that we can derive the closed-form solution for optimal transformations
in (2) without using any smoothing techniques as in Friedman & Stuetzle (1981), Breiman
& Friedman (1985) or Stone (1985), Tibshirani (1988). The key ingredient of our proposed
method is the positive definite regularized estimation of large covariance matrices under the
transnormal model. We introduce two novel rank-based covariance regularization methods
to deal with two popular covariance structures respectively: the rank-based positive defi-
nite `1 penalization for estimating sparse covariance matrices, and the rank-based banded
Cholesky decomposition regularization for estimating banded inverse covariance matrices.
Our proposed rank-based covariance regularization critically depends on a correlation ma-
trix that retains the desired sampling properties of the adjusted Spearman’s or Kendall’s
rank correlation matrix (Kendall 1948, Liu et al. 2012, Xue & Zou 2012).
The aforementioned correlation matrix is not necessarily positive definite (Devlin et al.
1975). We therefore propose a new nearest correlation matrix projection that inherits re-
quired sampling properties of the adjusted Spearman’s or Kendall’s rank correlation matrix,
which can be solved by an efficient alternating direction method of multipliers. By combin-
ing both strengths of quantile regression modeling and rank-based covariance regularization,
we can simultaneously address issues of nonlinearity, non-normality and high dimensionality
in estimating optimal transformations for high-dimensional regression. Especially, our pro-
posed method achieves the “oracle”-like convergence rate and provides a provable prediction
4
Page 5
interval under the high-dimensional setting where dimension is on the nearly exponential
order of sample size.
The rest of this paper is organized as follows. We first present the methodological details
of optimal prediction in multi-task quantile regression in Section 2. Section 3 establishes
the theoretical properties of our proposed method under the transnormal model. Section 4
contains simulation results and a real application to analyze the protein mass spectroscopy
data. Technical proofs are presented in the appendices.
2 Multi-task quantile regression: model and method
This section presents the complete methodological details for solving optimal predictions
in multi-task quantile regression, i.e., mint: Rp0 7→Rq0∑q0
j=1 E[ρτ (yj − tj(z))], where y and
z jointly follow a transnormal model. The transnormal family of distributions allows us
to obtain the closed-form solution of optimal predictions for multi-task quantile regression,
which is very appealing and powerful to simultaneously deal with non-normality and high
dimensionality. The transnormal model retains the nice interpretation of the normal model,
and enables us to make a good use of normal model and theory. Moreover, the monotone
transformation is easy to handle for quantile estimation.
2.1 The closed-form solution
Let Qτ (yj|z) be the τ -th quantile of the conditional distribution of yj given z, which
is the analytical solution to mintj : Rp0 7→R E[ρτ (yj − tj(z))]. Denote by Qτ (y|z) the τ -th
equicoordinate quantile of the conditional distribution of y given z, namely, Qτ (y|z) =
(Qτ (y1|z), . . . , Qτ (yp|z))′. Thus, the τ -th equicoordinate conditional quantile Qτ (y|z) is the
exact solution to (3), i.e.
Qτ (y|z) = arg mint: Rp0 7→Rq0
q0∑j=1
E[ρτ (yj − tj(z))],
By using the fact that h and g are monotone under the transnormal model, we have
Qτ (y|z) = Qτ (y|g(z)) = h−1(Qτ (h(y)|g(z))).
Since h is monotonically nondecreasing, it now follows from (2) that
Qτ (y|z) = h−1(Σ?yz(Σ
?zz)−1 · g(z) + Φ−1(τ) vdiag(Σ?
yy −Σ?yz(Σ
?zz)−1Σ?
zy))
(4)
5
Page 6
where Φ(·) is the cumulative distribution function of the standard normal distribution and
vdiag(A) denotes the vector formed by the diagonal element of A. Therefore, the semipara-
metric Gaussian copula model enables us to obtain the closed-form solution to multi-task
quantile regression.
Moreover, the closed-form solution (4) can be used to construct prediction intervals for
predicting y given z in high dimensions. To be more specific, we can obtain the closed-form
100(1− τ)% prediction interval as
[Q τ2(y|z),Q1− τ
2(y|z)].
For different values of τ , we need only to adjust the value Φ(τ). This is also an appealing
feature of the transnormal model.
The closed-form solution Qτ (y|z) uses true covariance matrices and transformations,
and thus it is not a feasible estimator. To utilize (4), we need to estimate the covariance
matrix Σ? and transformation functions f . It turns out that these two tasks are relatively
easy under the transnormal model. Section 2.2 describes how to estimate transformation
functions f and then show how to estimate structured covariance matrix under the high-
dimensional setting in Section 3. With estimated transformations f = (g, h) and structured
covariance matrix estimators Σ, we can derive the following plug-in estimator
Qτ (y|z) = h−1(Σyz(Σzz)
−1 · g(z) + Φ−1(τ) vdiag(Σyy − Σyz(Σzz)−1Σzy)
). (5)
Given the plug-in estimator (5), we further estimate the 100(1− τ)% prediction interval as
[Q τ2(y|z), Q1− τ
2(y|z)].
Remark 1. When τ = 12, it reduces to estimating optimal transformations from multi-task
median regression. By using the simple fact that Φ−1(12) = 0, it can be further simplified as
Qτ= 12(y|z) = h−1
(Σ?yz(Σ
?zz)−1 · g(z)
).
Remark 2. Compared to multi-task mean regression, multi-task quantile regression (includ-
ing median regression) is much more robust against outliers in measurements, in addition to
delivering the closed-form solution (4) under the transnormal model. In contrast, by solving
ordinary least squares, multi-task mean regression is
E(y|z) = E(y|g(z)) = E(h−1(h(y))|g(z)).
6
Page 7
But unlike the quantile regression, this can not be simplified further unless h is linear.
Remark 3. The nonlinearity of the transformations h(·) in (4) makes the difference of
conditional quantiles at different values of τ depend on z and thus models the effect of het-
eroscedasticy. In contrast, Wu et al. (2010), Zhu et al. (2012) and Fan et al. (2013) model the
heteroscedastic effect using the single-index quantile regression or semiparametric quantile
regression by imposing the model y = µ(z′β)+σ(z′β)·ε, where σ(z′β)·ε is a heteroscedastic
error and ε is independent of z. Unlike the closed-form expression (4), it is difficult to em-
ploy semiparametric quantile regression to simultaneously deal with nonlinearity and high
dimensionality.
2.2 Estimation of transformation
Note that fj(xj) ∼ N(0, 1) under the transnormal model for any j. Hence, the cumulative
distribution function of Xj admits the form
Fj(xj) = Φ(fj(xj)), or fj(xj) = Φ−1(Fj(xj)). (6)
Equation (6) provides a simple estimation of the transformation function fj. Let Fj(·) be the
empirical estimator of Fj(·), i.e. Fj(u) = 1n
∑ni=1 Ixij≤u. Define the Winsorized empirical
distribution function Fj(·) as
Fj(u) = δn · IFj(u)<δn + Fj(u) · Iδn≤Fj(u)≤1−δn + (1− δn) · IFj(u)>1−δn, (7)
where δn is the Winsorization parameter to avoid infinity values and to achieve better bias-
variance tradeoff. Following Mai & Zou (2012), we specify the Winsorization parameter as
δn = 1n2 , which facilitates both theoretical analysis and practical performance. Next, we can
estimate the transformation functions f in the transnormal model as follows:
f = (f1, . . . , fp) = (Φ−1 F1, . . . ,Φ−1 Fp).
Remark that the estimators are nondecreasing and can be substituted into (4) to estimate
the multi-task quantile regression function.
2.3 Estimation of correlation matrix
We present two covariance regularization methods based on i.i.d. transnormal data x1, . . . ,xn:
the rank-based positive definite `1 penalization for estimating sparse covariance matrices, and
7
Page 8
the rank-based banded Cholesky decomposition regularization for estimating banded precision
matrices. Both estimates achieve the critical positive definiteness, and they can be used in
(5) to estimate optimal transformations in multi-task quantile regression.
2.3.1 Positive definite sparse correlation matrix
Sparse covariance matrices are widely used in many applications where variables are permu-
tation invariant. By truncating small entries to zero, thresholding (Bickel & Levina 2008b,
Rothman et al. 2009, Fan et al. 2013) is a powerful approach to encourage (conditional) spar-
sity in estimating large covariance matrices. However, the resulting estimator may not be
positive definite in practice. Xue et al. (2012) proposed a computationally efficient positive-
definite `1-penalized covariance estimator to address the indefiniteness issue of thresholding.
Now we extend Xue et al. (2012) to the transnormal model. First, we introduce the
“oracle” positive-definite `1-penalized estimator using the “oracle” transformations f , i.e.,
Σo
`1= arg min
Σ
1
2‖Σ− Ro‖2
F + λ‖Σ‖1,off subject to diag(Σ) = 1; Σ εI
where Ro
is the “oracle” sample correlation matrix of the “oracle” data f(x1), . . . ,f(xn),
and ‖Σ‖1,off is the `1 norm of all off-diagonal elements in Σ.
Motivated by Σo
`1, we can use the adjusted Spearman’s or Kendall’s rank correlation
matrix (Kendall 1948) to derive a correlation matrix that is comparable with Ro. For ease
of presentation, we will focus on the adjusted Spearman’s rank correlation matrix throughout
this paper, since the same analysis can be adapted to the adjusted Kendall’s rank correlation
matrix. Let rj = (r1j, r2j, . . . , rnj)′ be the ranks for (x1j, x2j, . . . , xnj)
′. Denote by rjl =
corr(rj, rl) the Spearman’s rank correlation, and by rsjl = 2 sin(π6rjl) the adjusted Spearman’s
rank correlation. It is well-known that rsjl corrects the bias of rjl (Kendall 1948). Now we
consider the rank correlation matrix Rs
= (rsjl)p×p, which does not require estimating any
transformation function. Thus, the rank-based positive definite `1 penalization is as follows:
Σs
`1= arg min
Σ
1
2‖Σ− Rs‖2
F + λ‖Σ‖1,off subject to diag(Σ) = 1; Σ εI.
Remark that Σs
`1will guarantee the critical positive definiteness, and it can be efficiently
solved by the alternating direction method of multiplier in Xue et al. (2012).
8
Page 9
2.3.2 Banded Cholesky decomposition regularization
When an ordering exists among variables, the bandable structure is commonly used in esti-
mating large covariance matrices. Banding (Bickel & Levina 2008a) and tapering (Cai et al.
2010) were proposed to estimate bandable covariance matrices. But they have no guaran-
tee of positive definiteness in practice. With the appealing positive definiteness, banded
Cholesky decomposition regularization receives much attention, for example, Wu & Pourah-
madi (2003), Huang et al. (2006), Bickel & Levina (2008a), Levina et al. (2008) and Rothman
et al. (2010). In the sequel we propose the rank-based banded Cholesky decomposition reg-
ularization under the transnormal model.
First we introduce the “oracle” estimator to motivate our proposal. By using the “oracle”
data f(x1), . . . ,f(xn), the “oracle” banded Cholesky decomposition regularization estimates
the covariance matrix Σ? through banding the Cholesky factor of its inverse Θ?. Suppose
that Θ? has the Cholesky decomposition Θ? = (I −A)′D−1(I −A), where D is a p × pdiagonal matrix and A = (ajl)p×p is a lower triangular matrix with a11 = · · · = app = 0.
Due to the fact that f(x) ∼ Np(0,Σ?), it is easy to obtain that (I −A) · f(x) ∼ Np(0,D).
Let A = (a1, · · · ,ap)′. As in Bickel & Levina (2008a), the “oracle” estimator is derived by
regressing fj(xj) on its closest mink, j− 1 predecessors, i.e. ao1 = 0, and for j = 2, . . . , p,
aoj = arg minaj∈Aj(k)
1
n
n∑i=1
(fj(xij)− a′jf(xi)
)2, (8)
where Aj(k) = (α1, · · · , αp)′ : αl = 0 if l < j − k or l ≥ j. Then A is estimated by the
k-banded lower triangular matrix Ao
= (ao1, · · · , aop)′, and D is estimated by the diagonal
matrix Do
= diag(do1, · · · , dop) with doj being the residual variance, i.e.
doj =1
n
n∑i=1
(fj(xij)− (aoj)
′f(xi))2. (9)
Therefore, the “oracle” estimator ends up with a positive-definite estimator
Σo
chol = (I − Ao)−1(D
o)−1[(I − Ao
)′]−1,
which has the k-banded precision matrix Θo
chol = (I − Ao)′(D
o)−1(I − Ao
).
To mimic this “oracle” estimator, it is very important to observe that the “oracle” sample
covariance matrix Ro
= 1n
∑ni=1 f(xi)(f(xi))
′ plays the central role there. To see this point,
we notice that estimating aoj and doj only depends on the quadratic term
1
n
n∑i=1
(fj(xij)− a′jf(xi)
)2= a′jR
oaj − 2a′j r
oj + rojj, (10)
9
Page 10
where Ro
= (rojl)p×p and roj = (roj1, . . . , rojp)′ is its j-th row. Thus, we only need a positive
definite correlation matrix estimator that is comparable with Ro. The adjusted Spearman’s
or Kendall’s rank correlation matrix achieves the “oracle”-like exponential rate of conver-
gence, but it is not guaranteed to be positive definite (Devlin et al. 1975). We employ the
nearest correlation matrix projection that inherits the desired sampling properties of the
adjusted Spearman’s or Kendall’s rank correlation matrix, namely,
Rs
= arg minR‖R− Rs‖max subject to R εI; diag(R) = 1, (11)
where ‖·‖max is the entrywise `∞ norm and ε > 0 is some arbitrarily small constant satisfying
λmin(Σ?) ≥ ε, say ε = 10−4. Zhao et al. (2012) considered a related matrix projection, but
they used a smooth surrogate function ‖R − Rs‖νmax = max‖U‖1≤1〈U ,R − Rs〉 − ν
2‖U‖2
F ,
which would inevitably introduce unnecessary approximation error. Qi & Sun (2006) solved a
related nearest correlation matrix projection under the Frobinius norm. We use the entrywise
`∞-norm for theoretical considerations, as we now demonstrate. Notice that Σ? is a feasible
solution to the nearest correlation matrix projection (11). Hence, ‖Rs − Rs‖max ≤ ‖Σ? −Rs‖max holds by definition. By using the triangular inequality, R
sretains almost the same
sampling properties as Rs
in terms of estimation bound, i.e.
‖Rs −Σ?‖max ≤ ‖Rs − Rs‖max + ‖Rs −Σ?‖max ≤ 2‖Rs −Σ?‖max.
The details of nearest correlation matrix projection will be presented in the appendix.
Now we propose a feasible regularized rank estimator on the basis of Rs. Let R
s= (rsjl)p×p
and rsj = (rsj1, . . . , rsjp)′. We then substitute R
sinto the quadratic term (10) as
a′jRsaj − 2a′j r
sj + rsjj = a′jR
saj − 2a′j r
sj + 1,
where the fact that rsjj = 1 is used. Accordingly, we estimate A by
As
= (as1, · · · , asp)′,
with as1 = 0, and for j = 2, . . . , p,
asj = arg minaj∈Aj(k)
a′jRsaj − 2a′j r
sj + 1,
where Aj(k) = (α1, · · · , αp) : αi = 0 if i < j − k or i ≥ j. In addition, we estimate D by
Ds
= diag(ds1, · · · , dsp),
10
Page 11
with
dsj = (asj)′R
sasj − 2(asj)
′rsj + 1.
Thus, the rank-based banded Cholesky decomposition regularization yields the estimator
Σs
chol = (I − As)−1D
s[(I − As
)′]−1,
which has the k-banded precision matrix Θs
chol = (I − As)′(D
s)−1(I − As
).
3 Theoretical properties
This section presents theoretical properties of our proposed methods. We use several matrix
norms: the `1 norm ‖U‖`1 = maxj∑
i |uij|, the spectral norm ‖U‖`2 = λ1/2max(U ′U) and the
`∞ norm ‖U‖`∞ = maxi∑
j |uij|. For a symmetric matrix, its matrix `1 norm coincides with
its matrix `∞ norm. We use c or C to denote constants that do not depend on n or p.
Throughout this section, we follow Bickel & Levina (2008a,b) to assume that
ε0 ≤ λmin(Σ?) ≤ λmax(Σ?) ≤ 1
ε0
.
3.1 A general theory
First of all, we consider any regularized estimator Σ satisfying the following condition:
(C1) There exists a regularized estimator Σ satisfying the following concentration bound
‖Σ−Σ?‖`1 = OP (ξn,p), with ξn,p = o((log n)−1/2).
To simplify notation, we let Σ =
(Σzz Σzy
Σyz Σyy
). In the following theorem, we show that the
conditional distribution of h(y) given g(z) can be well estimated under Condition (C1).
Theorem 1. Assume the data follow the transnormal model. Suppose that there is κ ∈ (0, 1)
such that nκ log n log p0. Given the regularized estimator satisfying Condition (C1), we
have the error bounds concerning the estimated conditional distribution of h(y) given g(z)
in (2) as follows:
(i) (on the conditional mean vector)
‖ΣyzΣ−1
zz · g(z)−Σ?yz(Σ
?zz)−1 · g(z)‖max = OP
(√log n log p0
nκ+ ξn,p
√log n
).
11
Page 12
(ii) (on the conditional variance matrix)
‖(Σyy − ΣyzΣ−1
zz Σzy)− (Σ?yy −Σ?
yz(Σ?zz)−1Σ?
zy)‖`2 = OP (ξn,p).
Next, we show that the plug-in estimator Qτ (y|z) is asymptotically as good as the closed-
form solution Qτ (y|z) under mild regularity conditions. Let ψj(·) be the probability density
function of Xj. Define L = Σ?yz(Σ
?zz)−1 · g(z) + Φ−1(τ) vdiag(Σ?
yy − Σ?yz(Σ
?zz)−1Σ?
zy) and
L = Σyz(Σzz)−1 · g(z) + Φ−1(τ) vdiag(Σyy − Σyz(Σzz)
−1Σzy). Denote L = (L1, . . . , Lq0)′.
Theorem 2. Under the conditions of Theorem 1, we have the error bound concerning the
plug-in estimator in (5) as follows:
‖Qτ (y|z)−Qτ (y|z)‖max = OP
(1
M
√log q0
n+
1
M
√log n log p0
nκ+ξn,pM
√log n
). (12)
where M is the minimum of ψp0+j(x) over x ∈ Ij = [−|2Lj|, |2Lj|]∪[−f−1
p0+j(|2Lj|), f−1p0+j(|2Lj|)
]for any j = 1, . . . , q0, i.e., M = min
j=1,...,q0minx∈Ij
ψp0+j(x).
Theorem 2 immediately implies that the plug-in estimator can be used to construct
provable prediction intervals in high dimensions. For instance, we can use
[Q τ2(y|z), Q1− τ
2(y|z)]
to construct the 100(1− τ)% prediction interval for predicting y given z.
In what follows, we consider two parameter spaces for Σ? in the transnormal model,
Gq = Σ : maxj
∑j 6=i
|aij|q ≤ s0
Hα = Σ : maxj
∑j<i−k
|aij| ≤ c0k−α, ∀k.
These two parameter spaces were studied in Bickel & Levina (2008a,b) and Cai et al. (2010).
In the sequel we show that the proposed rank-based covariance regularization achieves the
“oracle”-like rate of convergence over Gq and Hα under the matrix `1 norm respectively.
Therefore, we can obtain the optimal prediction result on the basis of intermediate theoretical
results about rank-based covariance regularization.
12
Page 13
3.2 Positive definite sparse correlation matrix
First we derive the estimation bound for the rank-based positive-definite `1 penalization and
its application to estimate the conditional distribution of h(y) given g(z).
Theorem 3. Assume the data follow the transnormal model with Σ? ∈ Gq, and also assume
that n log p. Suppose that λmin(Σ?) ≥ ε0 s0(log p/n)1−q2 +ε. With probability tending to
1, the rank-based positive-definite `1 penalization with λ = c(log p/n)1/2 achieves the following
upper bound under the matrix `1 norm,
supΣ?∈Gq
∥∥Σs
`1−Σ?
∥∥`1≤ C · s0
( log p
n
) 1−q2.
Corollary 1. Under the conditions of Theorems 1, 2 and 3, with Σ = Σs
`1we have
supΣ?∈Gq
‖ΣyzΣ−1
zz · g(z)−Σ?yz(Σ
?zz)−1 ·g(z)‖max = OP
(√log n log p0
nκ+ s0
( log p
n
) 1−q2√
log n
),
and
supΣ?∈Gq
‖(Σyy − ΣyzΣ−1
zz Σzy)− (Σ?yy −Σ?
yz(Σ?zz)−1Σ?
zy)‖`2 = OP
(s0
( log p
n
) 1−q2
).
In light of Theorem 3 and Corollary 1, we can derive the convergence rate for the proposed
optimal prediction in the following theorem.
Theorem 4. Under same conditions of Theorems 1, 2 and 3, with Σ = Σs
`1we have
supΣ?∈Gq
‖Qτ (y|z)−Qτ (y|z)‖max = OP
(1
M
√log q0
n+
1
M
√log n log p0
nκ+s0
M
( log p
n
) 1−q2√
log n
).
3.3 Banded Cholesky decomposition regularization
Next we derive the estimation bound for the rank-based banded Cholesky decomposition
regularization with applications to estimate the conditional distribution of h(y) given g(z).
Theorem 5. Assume the data follow the transnormal model with Σ? ∈ Hα, and also assume
that n k2 log p. With probability tending to 1, the rank-based banded Cholesky decomposi-
tion regularization achieves the following upper bound under the matrix `1 norm,
supΣ?∈Hα
∥∥Σs
chol −Σ?∥∥`1≤ Ck
( log p
n
)1/2
+ Ck−α.
13
Page 14
If k = c ·( log pn
)−1
2(α+1) in the rank-based banded Cholesky decomposition regularization, we have
supΣ?∈Hα
∥∥Σs
chol −Σ?∥∥`1
= OP
(( log p
n
) α2(α+1)
).
Remark 5. Bickel & Levina (2008a) studied the banded Cholesky decomposition regu-
larization under the normal model. Their analysis directly applies to the “oracle” banded
Cholesky decomposition regularization. By Theorem 3 of Bickel & Levina (2008a), we have
supΣ?∈Hα
∥∥Σo
chol −Σ?∥∥`1
= OP
(( log p
n
) α2(α+1)
).
Therefore, the rank-based banded Cholesky decomposition regularization achieves the same
convergence rate as the “oracle” counterpart.
Corollary 2. Under the conditions of Theorems 1, 2 and 5, with Σ = Σs
chol we have
supΣ?∈Hα
∥∥∥ΣyzΣ−1
zz · g(z)−Σ?yz(Σ
?zz)−1 · g(z)
∥∥∥max
= OP
(√log n log p0
nκ+( log p
n
) α2(α+1)
√log n
),
and
supΣ?∈Hα
∥∥∥(Σyy − ΣyzΣ−1
zz Σzy)− (Σ?yy −Σ?
yz(Σ?zz)−1Σ?
zy)∥∥∥`2
= OP (( log p
n
) α2(α+1) ).
In light of Theorem 5 and Corollary 2, we can derive the convergence rate for the proposed
optimal prediction in the following theorem.
Theorem 6. Under the conditions of Theorems 1, 2 and 5, with Σ = Σs
chol we have
supΣ?∈Gq
∥∥∥Qτ (y|z)−Qτ (y|z)∥∥∥
max= OP
(1
M
√log q0
n+
1
M
√log n log p0
nκ+
√log n
M
( log p
n
) α2(α+1)
).
4 Numerical properties
This section examines the finite-sample performance of the proposed methods in Sections 2-
4. For space consideration, we focus only on the rank-based banded Cholesky decomposition
regularization and its application to multi-task median regression in numerical studies.
14
Page 15
4.1 Simulation studies
In this simulation study, we numerically investigate the “oracle” estimator, the proposed
rank-based estimator and the “naıve” estimator. We summarize notation and details of
three regularized estimator in Table 1. The “oracle” estimator serves a benchmark in the
numerical comparison, and the “naıve” estimator directly regularizes on the covariance of
the original data.
Table 1: List of three regularized estimators in the simulation study.
Methods Details
(Σochol, Θ
ochol, t
o1(z)) regularizing the “oracle” sample correlation matrix
(Σschol, Θ
schol, t
s1(z)) regularizing the adjusted Spearman’s rank correlation matrix
(Σnchol, Θ
nchol, t
n1 (z)) regularizing the usual sample correlation matrix
In Models 1-3, we consider three different designs for the inverse covariance matrix Ω?:
Model 1: Ω? = (I −A)′D−1(I −A): dii = 0.01, ai+1,i = 0.8 and dij = aij = 0 otherwise;
Model 2: Ω?: ω?ii = 1, ω?i,i±1 = 0.5, ω?i,i±2 = 0.25 and ω?ij = 0 otherwise;
Model 3: Ω?: ω?ii = 1, ω?i,i±1 = 0.4, ω?i,i±2 = ω?i,i±3 = 0.2, ω?i,i±4 = 0.1 and ω?ij = 0 otherwise.
Models 1-3 are autoregressive (AR) models widely used in time series analysis, and they were
considered by Huang et al. (2006) and Xue & Zou (2012). Based upon covariance matrix
Γ? = (Ω?)−1 = (γ?ij)p×p, we calculate the true correlation matrix Σ? = (σ?ij)p×p = (γ?ij√γ?iiγ
?jj
)p×p
and also its inverse Θ? = (Σ?)−1. Then, we generate n independent normal data from
Np(0,Σ?) and then transfer the normal data to the desired transnormal data (x1, . . . ,xn)
using the transformation function
[f−11 , f−1
2 , f−13 , f−1
4 , f−11 , f−1
2 , f−13 , f−1
4 , . . .], (13)
where four monotone transformations are considered: f1(x) = x13 (power transformation),
f2(x) = log( x1−x) (logit transformation), f3(x) = log(x) (logarithmic transformation) and
f4(x) = f1(x)Ix<−1 + f2(x)I−1≤x≤1 + (f3(x− 1) + 1)Ix>1.
In all cases we let n = 200 and p = 100, 200 & 500. For each estimator, tuning parameter
is chosen by cross-validation. Estimation accuracy is measured by the matrix `1-norm and
`2-norm averaged over 100 independent replications.
15
Page 16
Table 2: Performance of estimating the correlation matrix with Σo
chol, Σs
chol and Σn
chol. Esti-
mation accuracy is measured by the matrix `1-norm and `2-norm. Each metric is averaged
over 100 independent replications with standard errors in the bracket.
MethodModel 1 Model 2 Model 3
p=100 p=200 p=500 p=100 p=200 p=500 p=100 p=200 p=500
matrix `1 norm
Σochol
1.34 1.60 1.95 0.89 0.91 0.99 1.00 1.04 1.16
(0.05) (0.13) (0.27) (0.03) (0.04) (0.08) (0.02) (0.05) (0.12)
Σschol
1.49 1.68 2.08 0.92 0.99 1.10 1.03 1.06 1.21
(0.07) (0.11) (0.21) (0.03) (0.05) (0.14) (0.02) (0.04) (0.12)
Σnchol
3.21 3.49 3.78 1.45 1.48 1.63 1.11 1.22 1.28
(0.06) (0.09) (0.24) (0.02) (0.04) (0.11) (0.03) (0.05) (0.11)
matrix `2 norm
Σochol
0.78 0.95 1.10 0.48 0.49 0.54 0.52 0.54 0.59
(0.03) (0.06) (0.13) (0.01) (0.02) (0.04) (0.01) (0.02) (0.05)
Σschol
0.84 1.00 1.26 0.49 0.53 0.58 0.53 0.55 0.63
(0.03) (0.06) (0.11) (0.01) (0.02) (0.05) (0.01) (0.02) (0.06)
Σnchol
1.67 1.73 1.89 0.77 0.80 0.84 0.55 0.60 0.68
(0.05) (0.08) (0.14) (0.01) (0.02) (0.04) (0.01) (0.02) (0.05)
Tables 2-3 summarize the numerical performance of estimating the correlation matrix and
its inverse for three banded Cholesky decomposition regularization methods. We can see that
the “naıve” covariance regularization performs the worst in the presence of non-normality.
The rank-based covariance regularization effectively deals with the transnormal data, and
performs comparably with the “oracle” estimator. This numeric evidence is consistent with
theoretical results presented in Section 3.
Next, we compare the performance of estimating optimal transformations for multi-task
median regression. In each of the 100 replications, we simulate another 4 × n independent
transnormal data (xn+1, . . . ,x5n) as testing dataset. Now we take the first p0 = p/2 variables
in x as response y and take the second q0 = p/2 variables in x as predictor z. Now, given
three regularized covariance estimators Σo
chol, Σs
chol and Σn
chol, we examine their performance
in predicting yi based on zi for i = n+1, . . . , 5n. Recall that t1(z) = h−1
(Σyz(Σzz)−1 · g(z))
16
Page 17
Table 3: Performance of estimating the inverse correlation matrix with Θo
chol, Θs
chol and
Θn
chol. Estimation accuracy is measured by the matrix `1-norm and `2-norm. Each metric is
averaged over 100 independent replications with standard errors in the bracket.
MethodModel 1 Model 2 Model 3
p=100 p=200 p=500 p=100 p=200 p=500 p=100 p=200 p=500
matrix `1 norm
Σochol
4.07 4.72 5.14 1.83 1.95 1.99 2.21 2.29 2.37
(0.21) (0.37) (0.57) (0.08) (0.12) (0.25) (0.04) (0.09) (0.18)
Σschol
4.21 4.80 5.28 1.91 1.97 2.02 2.27 2.33 2.41
(0.24) (0.32) (0.55) (0.07) (0.14) (0.25) (0.04) (0.07) (0.16)
Σnchol
7.18 7.55 8.02 3.29 3.48 3.70 2.81 2.93 3.08
(0.22) (0.35) (0.54) (0.06) (0.14) (0.29) (0.03) (0.10) (0.19)
matrix `2 norm
Σochol
3.18 3.41 4.07 1.26 1.36 1.41 1.68 1.74 1.75
(0.10) (0.27) (0.35) (0.04) (0.07) (0.13) (0.04) (0.08) (0.10)
Σschol
3.29 3.59 4.10 1.35 1.39 1.45 1.69 1.75 1.77
(0.16) (0.25) (0.30) (0.04) (0.07) (0.08) (0.04) (0.08) (0.11)
Σnchol
4.98 5.15 5.33 2.64 2.81 2.89 2.29 2.32 2.39
(0.18) (0.25) (0.32) (0.04) (0.11) (0.18) (0.03) (0.07) (0.18)
provides the closed-form solution for multi-task median regression. Then we can derive the
closed-form solutions with respect to different regularized covariance estimators as follows:
• the “oracle” estimator: to
1(z) = h−1(Σo
yz(Σo
zz)−1 · g(z))
• the rank-based estimator: ts
1(z) = h−1
(Σs
yz(Σs
zz)−1 · g(z))
• the “naıve” estimator: tn
1 (z) = µy + Σn
yz(Σn
zz)−1 · (z − µz)
Tables 4 shows their numerical performances. Prediction accuracy is measured by the
difference of prediction errors, which is defined by subtracting the “oracle” prediction error,
i.e.
DPE(t1(z)) =1
4n
5n∑i=n+1
‖t1(zi)− yi‖`1 −1
4n
5n∑i=n+1
‖to1(z)− yi‖`1 .
17
Page 18
Table 4: Performance of relative prediction errors with to
1(z), ts
1(z) and tn
1 (z). Estimation
accuracy are measured by the `1-norm and averaged over 100 independent replications, with
the standard errors shown in the bracket.
MethodModel 1 Model 2 Model 3
p=100 p=200 p=500 p=100 p=200 p=500 p=100 p=200 p=500
ts1(z) vs. t
o1(z)
0.02 0.03 0.08 0.16 0.26 0.66 0.12 0.22 0.56
(0.00) (0.00) (0.01) (0.02) (0.03) (0.04) (0.01) (0.01) (0.04)
tn1 (z) vs. t
o1(z)
0.11 0.19 0.21 13.50 26.07 63.80 7.38 14.70 35.68
(0.00) (0.01) (0.03) (0.22) (0.58) (2.16) (0.14) (0.37) (1.28)
The relative prediction error is averaged over 100 independent replications.
As summarized in Table 4, we can see that the proposed rank-based estimator performs
very similarly to its oracle counterpart, and greatly outperforms the “naıve” estimator.
4.2 Application to the protein mass spectroscopy data
Recent advances in high-throughput mass spectroscopy technology has enabled biomedical
researchers to simultaneously analyze thousands of proteins. This subsection illustrates the
power of the proposed rank-based method in an application to study the prostate cancer
using the protein mass spectroscopy data (Adam et al. 2002), which was previously analyzed
by Levina et al. (2008). This dataset consists of the protein mass spectroscopy measurements
for the blood serum samples of 157 healthy people and 167 prostate cancer patients. In each
blood serum sample, the protein mass spectroscopy measures the intensity for the ordered
time-of-flight values, which are related to the mass over charge ratio of proteins. In our
analysis, we exclude the measurements with mass over charge ratio below 2000 to avoid
chemical artifacts, and perform the same preprocessing as in Levina et al. (2008) to smooth
the intensity profile. This gives a total of p = 218 ordered mass over charge ratio indices for
each sample. Then, we have the the control data xcoi = (xcoi,1, . . . , xcoi,218) for i = 1, . . . , 157,
and the cancer data xcaj = (xcaj,1, . . . , xcaj,218) for j = 1, . . . , 167. We refer the readers to
Adam et al. (2002) and Levina et al. (2008) for more details about this dataset and also the
preprocessing procedure. Our aim now is to use the more stable intensity measurements in
the latter 168 mass over charge ratio indices to optimally predict the more volatile intensity
18
Page 19
measurements in the first 50 indices.
The analysis of Levina et al. (2008) is on the basis of the multivariate normal covariance
matrix. Hence, we perform the normality test for both the control data and the cancel data.
Table 5 shows the number of rejecting the null hypotheses among 218 normality tests . There
is at least 50% of the mass spectroscopy measurements being non-normal at the significance
level of 0.05. Even after the strict Bonferroni correction, there are at least 30 indices in the
control data and 116 indices in the cancer data rejecting all the null hypotheses. Figure 1
further illustrates the non-normality issue (e.g. heavy tails, skewness) in two indices (109 &
218).
Table 5: Testing for normality. This table shows the number of rejecting the normality
hypothesis at different significance levels among 218 mass over charge ratio indices.
significance level Anderson-Darling Cramer-von Mises Kolmogorov-Smirnov
control data0.05 158 138 119
0.05/218 71 52 30
cancer data0.05 196 177 156
0.05/218 136 134 116
Intensity
Fre
quen
cy
3.8 4.0 4.2 4.4 4.6
05
1015
20
(A1) index=109
Intensity
Fre
quen
cy
3.60 3.65 3.70 3.75 3.80 3.85
02
46
810
1214
(A2) index=218
Intensity
Fre
quen
cy
3.6 3.8 4.0 4.2 4.4
010
2030
(B1) index=109
Intensity
Fre
quen
cy
3.70 3.75 3.80 3.85
05
1015
20
(B2) index=218
Figure 1: Illustration of the non-normality issue of the protein mass spectroscopy data: (A1)
and (A2) for the control data; (B1) and (B2) for the cancer data.
The nonnormality of measurements suggest us to use the transnormal model. Let zco
(zca) and yco (yca) denote the intensity measurements in the first 160 and second 58 indices of
the control (cancer) data respectively. We divide this dataset into training sets (xco1 , . . . ,xco120)
19
Page 20
& (xca1 , . . . ,xca120) and testing sets (xco121, . . . ,x
co157) & (xca121, . . . ,x
ca167). Note that the sample
estimator tsample
1 (z) = µy + Σsample
yz (Σsample
zz )−1 · (z− µz) is infeasible since the usual sample
correlation matrix Σsample
zz is not invertible. Then we consider two different methods to
predict y: the proposed rank-based estimator ts
1(z) = h−1
(Σs
yz(Σs
zz)−1 · g(z)) by using
the proposed rank-based banded Cholesky decomposition regularization and the “naıve”
estimator tn
1 (z) = µy + Σn
yz(Σn
zz)−1 · (z − µz) by using the banded Cholesky decomposition
regularization (Bickel & Levina 2008a). Following Levina et al. (2008), tuning parameters
are chosen via cross-validation based on the training data. The difference of prediction error
at the j-th index is computed, namely,
DPEj(t1(z)) = PEj(ts
1(z))− PEj(tn
1 (z)).
where PEj(t1(z)) is prediction error at the j-th index computed by averaging the absolute
prediction error |(t1(zi))j − yij| over i = 121, . . . , 157 for the control data and over i =
121, . . . , 167 for the cancer data. The differences of prediction errors are shown in Figure 2.
As demonstrated in Figure 2, the rank-based method outperforms the “naıve” estimator in
38 out of 50 indices for the control data and 46 out of 50 indices for the cancer data.
Moreover, we further use [Q τ2(y|z), Q1− τ
2(y|z)] to construct the 100(1 − τ)% predic-
tion interval. The predict target and prediction intervals are averaged over the testing set.
Specificly, we estimate the 100(1− τ)% prediction interval as follows,
100(1− τ)% PI = [1
34
239∑i=206
Q τ2(yi|zi),
1
34
239∑i=206
Q1− τ2(yi|zi)].
We plot the 95% prediction intervals for both the control data and the cancer data in Figure
3. Figure 3 shows that the estimated prediction intervals cover most of the predict target.
This suggests that our proposal well estimates both end points of prediction intervals in
practice, which is consistent with the asymptotic theory in Theorem 2.
Appendix A: ADMM for nearest matrix projection
The alternating direction method of multipliers (Boyd et al. 2011) has been widely applied
to solving large-scale optimization problems in high-dimensional statistical learning, for ex-
ample, covariance matrix estimation (Bien & Tibshirani 2011, Xue et al. 2012) and graphical
model selection (Ma et al. 2013, Danaher et al. 2013). Now we will design a new alternating
20
Page 21
0 10 20 30 40 50
−0.
5−
0.4
−0.
3−
0.2
−0.
10.
00.
1
Mass/Charge Ratio Index
Inte
nsity
(A) Relative Prediction Errors for the control data
0 10 20 30 40 50
−1.
2−
1.0
−0.
8−
0.6
−0.
4−
0.2
0.0
Mass/Charge Ratio Index
Inte
nsity
(B) Relative Prediction Errors for the cancer data
Figure 2: The differences of prediction errors between the existing Cholesky-based method
(Bickel & Levina 2008a) and the proposed rank-based estimate with τ = 0.5. The solid
dots are the mass/charge ratio indices in which our proposed method outperforms the BL
method. The proposed method outperforms in 38 out of 50 indices for the control data and
46 out of 50 indices for the cancer data.
21
Page 22
0 10 20 30 40 50
010
020
030
040
050
060
0
Mass/Charge Ratio Index
Inte
nsity
predict targetproposed estimateproposed 95% PI
(A) 95% Prediction Interval for the control data
0 10 20 30 40 50
010
020
030
040
050
060
070
0
Mass/Charge Ratio Index
Inte
nsity
predict targetproposed estimateproposed 95% PI
(B) 95% Prediction Interval for the cancer data
Figure 3: Prediction intervals for predicting protein mass spectroscopy intensities using our
proposed method. The predict target, proposed point estimate (τ = 0.5) and 95% prediction
intervals are averaged over the testing set. The solid dots are the mass/charge ratio indices
in which the proposed method outperforms the BL method.
22
Page 23
direction method of multipliers to solve the nearest correlation matrix projection (11). To
begin with, we introduce a new variable S = R− Rsto split the positive definite constraint
and the entrywise `∞ norm. Then we obtain the equivalent convex optimization problem as
(Rs, S) = arg min
(R,S)‖S‖max subject to R εI; S = R−Rs
; diag(S) = 0;S = S′. (14)
Let Λ be the Lagrange multiplier associated with S = R− Rsin (14). Next, we introduce
the augmented Lagrangian function
Lρ(R,S; Λ) = ‖S‖max − 〈Λ,R− S − Rs〉+
1
2ρ‖R− S − Rs‖2
F , (15)
for any given constant ρ > 0. For i = 0, 1, 2, . . ., we iteratively solve (Ri+1,Si+1) through
alternating minimization with respect to R and S respectively, i.e.
(Ri+1,Si+1) = arg min(R,S)
Lρ(R,S; Λi),
and then update the Lagrange multiplier Λ by
Λi+1 = Λi − 1
ρ(Ri+1 − Si+1 − Rs
).
To sum up, the entire algorithm proceeds sequentially as follows till convergence,
R step : Ri+1 = arg minRεI
Lρ(R,Si; Λi); (16)
S step : Si+1 = arg minS=S′: diag(S)=0
Lρ(Ri+1,S; Λi); (17)
Λ step : Λi+1 = Λi − 1
ρ(Ri+1 − Si+1 − Rs
). (18)
We note that both subproblems (16) and (17) indeed have simple solutions, which we will ex-
plain in the sequel, and thus we are able to avoid solving a sequence of complex subproblems.
In what follows, we point out the explicit solutions to both (16) and (17).
First we consider the R step. Denote by (U)+ the projection of a matrix U onto the
convex set R εI. Suppose that U has eigen-decomposition∑p
i=1 λiuiu′i. Then, it is
easy to see that (U )+ =∑p
i=1 max(λi, ε)uiu′i. The subproblem (16) is then solved as follows,
Ri+1 = arg minRεI
−〈Λi,R〉+1
2ρ‖R− Si − Rs‖2
F = (Si + Rs
+ ρΛi)+.
We now consider the S step. Define two subspaces S = S : diag(S) = 0; S = S′ and
T = T : ‖T ‖1 ≤ ρ; diag(T ) = 0; T = T ′. We also define
M i+1 = Ri+1 − Rs − ρΛi. (19)
23
Page 24
Note that the subproblem in (17) can be equivalently written as follows,
Si+1 = arg minS∈S‖S‖max + 〈Λi,S〉+
1
2ρ‖Ri+1 − S − Rs‖2
F
= arg minS∈S‖S‖max +
1
2ρ‖S −M i+1‖2
F .
Lemma 1 will show that (17) can be solved by the projection onto the entrywise `1 ball.
Lemma 1. For any given symmetric matrix M = (mjl)p×p, we define
S = arg minS∈S‖S‖max +
1
2ρ‖S −M‖2
F and T = arg minT∈T
1
2‖T −M‖2
F .
Let S = (sjl)p×p and T = (tjl)p×p. Then for any (j, l), we always have
sjl = (mjl − tjl)× Ij 6=l.
Now we define
T i+1 = arg minT∈T
‖T −M i+1‖2F . (20)
We note that T i+1 is essentially a projection of a 12p(p − 1)-dimensional vector onto the `1
ball in R 12p(p−1), and it can be efficiently solved by applying the exact projection algorithm
(Duchi et al. 2008) in the O(p2) expected time. Let T i+1 = (ti+1jl )p×p and M i+1 = (mi+1
jl )p×p.
Then by Lemma 1, we can obtain the desired closed-form solution for (17), i.e.
Si+1 =((mi+1
jl − ti+1jl )× Ij 6=l
)p×p .
Algorithm 1 Proposed alternating direction method of multipliers for obtaining Rs.
1. Initialization: ρ, S0 and Λ0.
2. Iterative alternating direction augmented Lagrangian step: for the i-th iteration
2.1 Solve Ri+1 = (Si + Rs
+ ρΛi)+;
2.2 Solve Si+1 =((mi+1
jl − ti+1jl )× Ij 6=l
)p×p with M i+1 and T i+1 as in (19) & (20);
2.3 Update Λi+1 = Λi − 1ρ(Ri+1 − Si+1 − Rs
).
3. Repeat the above cycle till convergence.
The global convergence of Algorithm 1 can be obtained by following Xue et al. (2012).
For space consideration, we omit the technical proof in this paper. The global convergence
guarantees that Algorithm 1 always converges to the optimal solution (Rs, S, Λ) of (14)
from any initial value (S0,Λ0) for any specified constant ρ > 0.
24
Page 25
Appendix B: Technical Proofs
Proof of Lemma 1
Proof. First we note that the dual norm of the entrywise `1 norm is the entrywise `∞ norm.
Then it is easy to see that ‖S‖max = 1ρ
maxT∈T 〈T ,S〉. Now we have
minS∈S
‖S‖max +1
2ρ‖S −M‖2
F
= minS∈S
maxT∈T〈T ,S〉+
1
2‖S −M‖2
F
= maxT∈T
minS∈S〈T ,S〉+
1
2‖S −M‖2
F
= maxT∈T
minS∈S
1
2‖S −M + T ‖2
F −1
2‖T −M‖2
F +1
2‖M‖2
F .
Notice that the optimal solution for the inner problem with respect to S is given by
S = (sjl)p×p = ((mjl − tjl) · Ij 6=l)p×p.
After we substitute S into the inner problem with respect to S, it suffices to solve the outer
problem with respect to T , namely,
maxT∈T
−1
2‖T −M‖2
F +1
2‖M‖2
F = minT∈T
‖T −M‖2F .
Now by definition, T = (tjl)p×p gives the optimal solution to the above problem with respect
to T . Therefore, it is obvious to see that sjl = (mjl− tjl)×Ij 6=l always holds for the optimal
solutions S and T . This completes the proof of Lemma 1.
Proof of Theorem 1
Proof. In the first part, we bound I = ‖ΣyzΣ−1
zz g(z) − Σ?yz(Σ
?zz)−1g(z)‖max. To this end,
we bound I1 = ‖ΣyzΣ−1
zz − Σ?yz(Σ
?zz)−1‖`∞ and I2 = ‖g(z) − g(z)‖max = OP (
√logn log p0
nκ)
respectively. Let ϕ1 = ‖(Σ?zz)−1‖`∞ and ϕ2 = ‖Σ?
yz‖`∞ for ease of notation. Notice that
ΣyzΣ−1
zz −Σ?yz(Σ
?zz)−1 = Σ?
yz · (Σ−1
zz − (Σ?zz)−1) + (Σyz −Σ?
yz) · (Σ?zz)−1
+ (Σyz −Σ?yz) · (Σ
−1
zz − (Σ?zz)−1).
By using the triangular inequality again, we then have
I1 ≤ ‖Σ?yz‖`∞ · ‖Σ
−1
zz − (Σ?zz)−1‖`∞ + ‖(Σ?
zz)−1‖`∞ · ‖Σyz −Σ?
yz‖`∞+ ‖Σyz −Σ?
yz‖`∞ · ‖Σ−1
zz − (Σ?zz)−1‖`∞
25
Page 26
To bound I1, we need to bound ‖Σyz −Σ?yz‖`∞ and ‖Σ−1
zz − (Σ?zz)−1‖`∞ . By definition, it is
easy to see that ‖Σyz −Σ?yz‖`∞ ≤ ‖Σ−Σ?‖`∞ . Moreover, we have
‖Σ−1
zz − (Σ?zz)−1‖`∞ ≤ C‖Σzz −Σ?
zz‖`∞ · ‖(Σ?zz)−1‖`∞ ≤ Cϕ1 · ‖Σ−Σ?‖`∞ .
Thus we can derive the explicit upper bound for I1 as
I1 ≤ C(ϕ1ϕ2 + ϕ1) · ‖Σ−Σ?‖`∞ = C(ϕ1ϕ2 + ϕ1) · ‖Σ−Σ?‖`1 = OP (ξn,p). (21)
Next, we prove that
I2 = ‖g(z)− g(z)‖max = OP (
√log n log p0
nκ). (22)
To this end, note that gj(zj) follows the standard normal distribution. Hence we have
Pr(|gj(zj)| >√
(1− κ) log n) =
∫ +∞
√(1−κ) logn
C · exp(−t2/2) · dt
≤ C√log n
·∫ +∞
√(1−κ) logn
exp(−t2/2) · tdt
≤ C · 1
n1−κ2
√log n
.
Then it immediately implies that
‖g(z)‖max = OP (√
log n). (23)
Let η =√
(1− κ) log n. Now by using Lemma 1 of Mai & Zou (2012), we have
Pr( sup|gj(zj)|≤η
|gj(zj)− gj(zj)| ≥ η) ≤ C exp(−cn13η2
log n) + p0 exp(− cn
13
log n)
Thus, we apply the union bound to obtain
Pr(‖g(z)− g(z)‖max ≥ η) ≤p0∑j=1
Pr(|gj(zj)| > η) +
p0∑j=1
Pr( sup|gj(zj)|≤η
|gj(zj)− gj(zj)| ≥ η)
≤ C · p0
n1−κ2
√log n
+ Cp0 exp(−cnκη2
log n) + p0 exp(− cnκ
log n),
which immediately implies the upper bound (22). Now since
ΣyzΣ−1
zz g(z)−Σ?yz(Σ
?zz)−1g(z)
= (ΣyzΣ−1
zz −Σ?yz(Σ
?zz)−1) · (g(z)− g(z)) +
(Σ?yz(Σ
?zz)−1) · (g(z)− g(z)) + (ΣyzΣ
−1
zz −Σ?yz(Σ
?zz)−1) · g(z),
26
Page 27
we can combine (21), (22), (23) and also the triangle inequality to bound I as
I ≤ ‖ΣyzΣ−1
zz −Σ?yz(Σ
?zz)−1‖`∞ · ‖g(z)− g(z)‖max
+ ‖Σ?yz(Σ
?zz)−1‖`∞ · ‖g(z)− g(z)‖max
+ ‖ΣyzΣ−1
zz −Σ?yz(Σ
?zz)−1‖`∞ · ‖g(z)‖max
= OP (
√log n log p0
nκ+ ξn,p
√log n).
In the second part, we bound J = ‖(Σyy− ΣyzΣ−1
zz Σzy)− (Σ?yy−Σ?
yz(Σ?zz)−1Σ?
zy)‖`2 . By
using the triangular inequality, we have
J ≤ ‖Σyy −Σ?yy‖`2 + ‖ΣyzΣ
−1
zz Σzy −Σ?yz(Σ
?zz)−1Σ?
zy‖`2 ≡ J1 + J2.
Notice that
J2 = Σyz(Σ?zz)−1(Σzy −Σ?
zy) + (Σyz −Σ?yz)(Σ
?zz)−1Σ?
zy + Σyz(Σ−1
zz − (Σ?zz)−1)Σzy.
By using the definition of spectral norm, we have
‖Σ−Σ?‖`2 = maxu=(u′z ,u
′y)′: ‖u‖`2=1
‖
(Σzz −Σ?
zz Σzy −Σ?zy
Σyz −Σ?yz Σyy −Σ?
yy
)(uz
uy
)‖`2
≥ maxuz : ‖uz‖`2=1
‖
(Σzz −Σ?
zz Σzy −Σ?zy
Σyz −Σ?yz Σyy −Σ?
yy
)(uz
0
)‖`2
≥ maxuz : ‖uz‖`2=1
‖(Σyz −Σ?yz)uz‖`2
= ‖Σyz −Σ?yz‖`2 .
By similar arguments, we also have ‖Σ−Σ?‖`2 ≥ ‖Σzy−Σ?zy‖`2 . Given that ε0 ≤ λmin(Σ?) ≤
λmax(Σ?) ≤ 1ε0
, we then apply the sub-multiplicative property to obtain
‖Σyz(Σ?zz)−1(Σzy −Σ?
zy)‖`2 ≤ C · ‖Σ−Σ?‖`2‖(Σyz −Σ?
yz)(Σ?zz)−1Σ?
zy‖`2 ≤ C · ‖Σ−Σ?‖`2‖Σyz(Σ
−1
zz − (Σ?zz)−1)Σzy‖`2 ≤ C · ‖Σ−Σ?‖`2
Thus we can combine the above bounds to bound J2 as
J2 = ‖ΣyzΣ−1
zz Σzy −Σ?yz(Σ
?zz)−1Σ?
zy‖`2 ≤ C · ‖Σ−Σ?‖`2
In addition, due to the fact that J1 = ‖Σyy −Σ?yy‖`2 ≤ ‖Σ−Σ?‖`2 , we have
J ≤ J1 + J2 ≤ C · ‖Σ−Σ?‖`2 = C · ‖Σ−Σ?‖`1 = OP (ξn,p)
This completes the proof of Theorem 1.
27
Page 28
Proof of Theorem 2
Proof. To simplify notation, we define µy|z = Σ?yz(Σ
?zz)−1·g(z), Σy|z = Σ?
yy−Σ?yz(Σ
?zz)−1Σ?
zy,
µy|z = Σyz(Σzz)−1 · g(z) and Σy|z = Σyy−ΣyzΣ
−1
zz Σzy. Then L = µ?y|z +Φ−1(τ) vdiag(Σ?y|z)
and L = µy|z + Φ−1(τ) vdiag(Σy|z). In addition, we introduce Qτ (y|z) = h−1(L).
In order to bound Qτ (y|z)−Qτ (y|z), we use the triangle inequality to obtain that
‖Qτ (y|z)−Qτ (y|z)‖max ≤ ‖Qτ (y|z)− Qτ (y|z)‖max + ‖Qτ (y|z)−Qτ (y|z)‖max
≤ ‖h−1(L)− h−1(L)‖max + ‖h−1(L)− h−1(L)‖max
≡ I + J.
In the sequel, we derive probability bounds for I and J respectively.
To bound I, we first use the triangle inequality again to bound
‖L−L‖max ≤ ‖µy|z − µ?y|z‖max + Φ−1(τ) · ‖vdiag(Σy|z)− vdiag(Σ?y|z)‖max
≤ ‖µy|z − µ?y|z‖max + Φ−1(τ) · ‖Σy|z −Σ?y|z‖`2
≤ OP (
√log n log p0
nκ+ ξn,p
√log n) + Φ−1(τ) ·OP (ξn,p)
= OP (
√log n log p0
nκ+ ξn,p
√log n)
Notice that√n−κlog n log p0 + ξn,p
√log n = o(1) ≤ |L|. Thus, with probability 1, we have
|L| ≤ 2 · |L|, and then, Φ(−2 · |L|) ≤ Φ(L) ≤ Φ(2 · |L|). (24)
In view of (24), we use the probability inequality from Page 75 of Serfling (1980) to obtain
Pr(|F−1p0+j(t)− F−1
p0+j(t)| > ε) ≤ 2 exp(−2nδ2ε )
where δε = mint∈[−|2Lj |,|2Lj |]Fj(t + ε) − Fj(t), Fj(t) − Fj(t − ε), for any j = 1, . . . , q0 and
t ∈ [Φ(−2|Lj|),Φ(2|Lj|)] ⊂ Ij. Recall that f = (Φ−1 F1, . . . ,Φ−1 Fp) = (g, h). Note that
h−1j (Lj)− h−1
j (Lj) = F−1j (Φ(Lj))− F−1
j (Φ(Lj)) is the j-th element of h−1
(L)− h−1(L) for
28
Page 29
j = 1, . . . , q0. Now we use the union bound to derive
Pr(‖Qτ (y|z)− Qτ (y|z)‖max > ε) ≤q0∑j=1
Pr(|h−1j (Lj)− h−1
j (Lj)| > ε)
≤q0∑j=1
Pr(|F−1p0+j(Φ(Lj))− F−1
p0+j(Φ(Lj))| > ε)
≤q0∑j=1
maxt∈[Φ(−2|Lj |),Φ(2|Lj |)]
Pr(|F−1p0+j(t)− F−1
p0+j(t)| > ε)
≤ q0 · exp(−2nM2ε2)
where we use the fact that δε ≥ mint∈Ij ψj(t)ε ≥Mε in the last inequality. Thus, we have
‖Qτ (y|z)− Qτ (y|z)‖max = OP (1
M
√log q0
n)
To bound J , we apply the mean-value theorem to the j-th element of h−1(L)−h−1(L),
that is, h−1j (Lj)− h−1
j (Lj) = F−1j (Φ(Lj))− F−1
j (Φ(Lj)). Namely,
h−1j (Lj)− h−1
j (Lj) = (h−1j )′(Lj) · (Lj − Lj) =
φ(Lj)
ψp0+j(F−1p0+j(Φ(Lj)))
· (Lj − Lj)
where Lj is no the line segment between Lj and Lj, and in the second equality we use the
fact that (h−1j )′(t) = (F−1
p0+j Φ)′(t) = φ(t)/ψp0+j(F−1p0+j(Φ(t))). With probability 1, we have
‖L‖ ≤ ‖L−L‖+ ‖L‖ ≤ ‖L−L‖+ ‖L‖ ≤ 2‖L‖.
Then, φ(Lj)/ψp0+j(F−1p0+j(Φ(Lj))) ≤ φ(Lj)/(minx∈Ij ψp0+j(x)) ≤ (M
√2π)−1. Thus we have
‖Qτ (y|z)−Qτ (y|z)‖max = ‖h−1(L)− h−1(L)‖max
≤ maxj=1,...,q0
φ(Lj)
ψp0+j(F−1p0+j(Φ(Lj)))
· ‖L−L‖max
= OP (1
M
√log n log p0
nκ+ξn,pM
√log n)
Therefore, we obtain the desired bound for Qτ (y|z)−Qτ (y|z) as follows,
‖Qτ (y|z)−Qτ (y|z)‖max ≤ I + J = OP (1
M
√log q0
n+
1
M
√log n log p0
nκ+ξn,pM
√log n).
This completes the proof of Theorem 2.
29
Page 30
Proof of Theorem 3
Proof. Define the rank-based soft-thresholding estimator
Σs
soft =(sλ(r
sjl))p×p
where sλ(·) applies the soft-thresholding rule for the off-diagonal elements of Rs. Under the
event ‖Rs −Σ?‖max ≤ c · (n−1 log p)1/2, we can follow the same line of proof as in Bickel
& Levina (2008b) to show that
supΣ?∈Gq
∥∥Σs
soft −Σ?∥∥`1≤ C · s0
( log p
n
) 1−q2.
Given that λmin(Σ?) ≥ ε0 s0(log p/n)1−q2 + ε, it is easy to see that λmin(Σ
s
soft) ≥ ε. Thus,
Σs
soft is the unique solution to the convex rank-based positive-definite `1 penalization under
the same event, which immediately implies that
supΣ?∈Gq
∥∥Σs
`1−Σ?
∥∥`1≤ C · s0
( log p
n
) 1−q2.
We cite the entry-wise estimation bound of Rs
from Xue & Zou (2012) that
Pr(‖Rs −Σ?‖max ≤ c · (n−1 log p)1/2p) −→ 1,
as n tends to ∞. This completes the proof of Theorem 2.
Proof of Theorem 5
Proof. As demonstrated in Bickel & Levina (2008a), it is sufficient to bound ‖Θs
chol−Θ?‖`1 .Denote by Bk(U) = (ujlI|j−l|≤k)p×p the k-banded estimator of U = (ujl)p×p. Let Ak =
Bk(A) be the k-banded estimator of A, and Dk be the corresponding diagonal matrix.
Define Θk = (I −Ak)′D−1
k (I −Ak). Denote by Bk(Θ?) the k-banded estimator of Θ?.
We first apply the triangle inequality to obtain
‖Θs
chol −Θ?‖`1 ≤ ‖Θs
chol −Θk‖`1 + ‖Θk −Bk(Θ?)‖`1 + ‖Bk(Θ
?)−Θ?‖`1 . (25)
Notice that ‖Θk −Bk(Θ?)‖`1 in (25) can be bounded as in Bickel & Levina (2008a), i.e.
‖Θk −Bk(Θ?)‖`1 ≤ Ck−α.
30
Page 31
Also by the definition of Hα, it is easy to see that ‖Bk(Θ?)−Θ?‖`1 ≤ Ck−α holds. Then it
is sufficient for us to bound ‖Θs
chol −Θk‖`2 only.
Under the event ‖Rs −Σ?‖max ≤ ζ with kζ = o(1), we follow (A14) – (A15) in Bickel
& Levina (2008a) to show that ‖As −Ak‖max ≤ Cζ, and meanwhile, follow (A16) – (A17)
in Bickel & Levina (2008a) to show that ‖Ds −Dk‖max ≤ Ck2ζ2. Then by using triangular
inequality, we obtain the upper bound for ‖Θs
chol −Θk‖`1 under the same event as follows,
‖Θs
chol −Θk‖`1 = ‖(I − As)′(D
s)−1(I − As
)− (I −Ak)′D−1
k (I −Ak)‖`1≤ ‖(Ak − A
s)′D−1
k (I − As)‖`1 + ‖(I −Ak)
′D−1k (Ak − A
s)‖`1 +
‖(I − As)′((D
s)−1 −D−1
k )(I − As)‖`1
≤ C‖Ak − As‖`1 + C‖Ds −Dk‖`1 ,
where the last inequality used the fact that ‖D−1‖`1 ≤ C1 and ‖I −A‖`1 ≤ C2 by Lemma
A.2 of Bickel & Levina (2008a) and also the fact that kζ = o(1). Note that
‖Ds −Dk‖`1 ≤ ‖Ds −Dk‖max ≤ Ck2ζ2
and
‖Ak − As‖`1 ≤ k · ‖Ak − A
s‖max ≤ Ckζ.
By using the fact that kζ = o(1) again, we have
‖Θs
chol −Θk‖`1 ≤ Ckλ+ Ck2λ2 ≤ Ckζ.
Next we cite an entry-wise estimation bound from Xue & Zou (2012) that for any positive
quantity ζ satisfying nζ = o(1), there exists some fixed constant c such that
Pr(‖Rs −Σ?‖max > ζ) ≤ p2 exp(−cnζ2).
Since Rs
is the nearest correlation matrix projection of Rs, then we have
‖Rs −Σ?‖max ≤ 2‖Rs −Σ?‖max ≤ 2λ.
Thus, taking ζ = c ·n−1/2 log1/2 p yields the upper bound for ‖Θs
chol−Θ?‖`1 . This completes
the proof of Theorem 3.
31
Page 32
Acknowledgement
Jianqing Fan’s research is supported in part by R01GM100474-04 and National Science
Foundation grants DMS-1206464 and DMS-1406266. Hui Zou’s research is supported in
part by grants from the National Science Foundation grant DMS-0846068 and a grant from
the Office of Naval Research. Lingzhou Xue was supported by the National Institutes of
Health grant R01-GM072611-09.
References
Adam, B., Qu, Y., Davis, J., Ward, M., Clements, M., Cazares, L., Semmes, O., Schellham-
mer, P., Yasui, Y., Feng, Z. & Wright, G. (2002), ‘Serum protein fingerprinting coupled
with a pattern-matching algorithm distinguishes prostate cancer from benign prostate
hyperplasia and healthy men’, Cancer Research 62, 3609-3614.
Bickel, P. & Levina, E. (2008a), ‘Regularized estimation of large covariance matrices’, The
Annals of Statistics 36, 199–227.
Bickel, P. & Levina, E. (2008b), ‘Covariance regularization by thresholding’, The Annals of
Statistics 36, 2577–2604.
Bien, J. & Tibshirani, R. (2011), ‘Sparse estimation of a covariance matrix’, Biometrika
98, 807–820.
Box, G. E. P. & Cox, D. R. (1964), ‘An analysis of transformations (with discussion)’, Journal
of the Royal Statistical Society. Series B 26, 211–252.
Box, G. E. P. & Tidwell, P. W. (1962), ‘Transformation of the independent variables’, Tech-
nometrics 4, 531–550.
Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. (2011), ‘Distributed optimization
and statistical learning via the alternating direction method of multipliers’, Foundations
and Trends in Machine Learning 3, 1–122.
Bradic, J., Fan, J. & Wang, W. (2011), ‘Penalized composite quasi-likelihood for ultrahigh
dimensional variable selection’, Journal of the Royal Statistical Society: Series B 73, 325–
349.
32
Page 33
Breiman, L. & Friedman, J. (1985), ‘Estimating optimal transformations for multiple regres-
sion and correlation (with discussion)’, Journal of the American Statistical Association
80, 580–598.
Cai, T., Zhang, C. & Zhou, H. (2010), ‘Optimal rates of convergence for covariance matrix
estimation’, The Annals of Statistics 38, 2118–2144.
Danaher, P., Wang, P. & Witten, D. (2013), ‘The joint graphical lasso for inverse covariance
estimation across multiple classes’, Journal of the Royal Statistical Society: Series B, to
appear.
Devlin, S., Gnanadesikan, R. & Kettenring, J. (1975), ‘Robust estimation and outlier detec-
tion with correlation coefficients’, Biometrika 62, 531–545.
Duchi, J., Shalev-Shwartz, S., Singer, Y. & Chandra, T. (2008), ‘Efficient projections onto the
`1-ball for learning in high dimensions’, Proceedings of the 25th International Conference
on Machine Learning, pp. 272–279.
Fan, J., Liao, Y. & Mincheva, M. (2013), ‘Large covariance estimation by thresholding prin-
cipal orthogonal complements (with discussion)’, Journal of the Royal Statistical Society:
Series B, 75, 603-680.
Fan, J., Ke, T., Liu, H., and Xia, L. (2013). QUADRO: A supervised dimension reduction
method via rayleigh quotient optimization. The Annals of Statistics, to appear.
Fan, Y., Hardle, W., Wang, W. & Zhu, L. (2013), ‘Composite quantile regression for the
single-index model’, SFB 649 Discussion Papers, No. 2013-010.
Friedman, J. H. (2001), ‘Greedy function approximation: a gradient boosting machine.’, The
Annals of Statistics 29, 1189–1232.
Friedman, J. H. (2002), ‘Stochastic gradient boosting’, Computational Statistics & Data
Analysis 38, 367–378.
Friedman, J. H. & Stuetzle, W. (1981), ‘Projection pursuit regression’, Journal of the Amer-
ican statistical Association 76, 817–823.
Han, F. & Liu, H. (2012), ‘Semiparametric principal component analysis’, Advances in Neural
Information Processing Systems, pp. 171–179.
33
Page 34
Huang, J., Liu, N., Pourahmadi, M. & Liu, L. (2006), ‘Covariance matrix selection and
estimation via penalised normal likelihood’, Biometrika 93, 85–98.
Kendall, M. (1948), Rank Correlation Methods, Charles Griffin and Co. Ltd., London.
Koenker, R. (2005), Quantile Regresssion. Cambridge University Press.
Koenker, R. & Bassett, G. J. (1978), ‘Regression quantiles’, Econometrica 46, 33–50.
Koenker, R. & Geling, O. (2001), ‘Reappraising medfly longevity: a quantile regression
survival analysis’, Journal of the American Statistical Association 96, 458–468.
Koenker, R. & Xiao, Z. (2006), ‘Quantile autoregression (with discussion)’, Journal of the
American Statistical Association 101, 980–990.
Lam, C. & Fan, J. (2009), ‘Sparsistency and rates of convergence in large covariance matrix
estimation’, The Annals of Statistics 37, 4254–4278.
Levina, E., Rothman, A. & Zhu, J. (2008), ‘Discriminant analysis through a semiparametric
model’, The Annals of Applied Statistics 2, 245–263.
Lin, Y. & Jeon, Y. (2003), ‘Discriminant analysis through a semiparametric model’,
Biometrika 90, 379–392.
Liu, H., Han, F., Yuan, M., Lafferty, J. & Wasserman, L. (2012), ‘High dimensional semi-
parametric gaussian copula graphical models’, The Annals of Statistics 40, 2293–2326.
Liu, H., Lafferty, J. & Wasserman, L. (2009), ‘The nonparanormal: semiparametric esti-
mation of high dimensional undirected graphs’, Journal of Machine Learning Research
10, 1–37.
Ma, S., Xue, L. & Zou, H. (2013), ‘Alternating direction methods for latent variable gaussian
graphical model selection’, Neural Computation 25, 2172–2198.
Mai, Q. & Zou, H. (2012), ‘Semiparametric sparse discriminant analysis in ultra-high dimen-
sions’, Manuscript .
Qi, H. & Sun, D. (2006), ‘A quadratically convergent Newton method for computing the
nearest correlation matrix’, SIAM Journal on Matrix Analysis and Applications 28, 360–
385.
34
Page 35
Rothman, A. J., Levina, E. & Zhu, J. (2009), ‘Generalized thresholding of large covariance
matrices’, Journal of the American Statistical Association 104, 177–186.
Rothman, A., Levina, E. & Zhu, J. (2010), ‘A new approach to cholesky-based covariance
regularization in high dimensions’, Biometrika 97, 539–550.
Serfling, R. (1980), Approximation Theorems of Mathematical Statistics, John Wiley & Sons.
Stone, C.J. (1985), Additive regression and other nonparametric models. The Annals of
Statistics, 13, 689–705.
Tibshirani, R. (1988), ‘Estimating transformations for regression via additivity and variance
stabilization’, Journal of the American Statistical Association 83, 394–405.
Wang, H. & He, X. (2007), ‘Detecting differential expressions in GeneChip microarray stud-
ies: a quantile approach’, Journal of the American Statistical Association 102, 104–112.
Wei, Y. & He, X. (2006), ‘Conditional growth charts’, The Annals of Statistics 34, 2069–
2097.
Wei, Y., Pere, A., Koenker, R. & He, X. (2006), ‘Quantile regression methods for reference
growth charts’, Statistics in Medicine 25, 1369–1382.
Wu, W. & Pourahmadi, M. (2003), ‘Nonparametric estimation of large covariance matrices
of longitudinal data’, Biometrika 90, 831–844.
Wu, T., Yu, K. & Yu, Y. (2010), ‘Single-index quantile regression’, Journal of Multivariate
Analysis 101, 1607–1621.
Xue, L., Ma, S. & Zou, H. (2012), ‘Positive definite `1 penalized estimation of large covariance
matrices’, Journal of the American Statistical Association 107, 1480–1491.
Xue, L. & Zou, H. (2012), ‘Regularized rank estimation of high-dimensional nonparanormal
graphical models’, The Annals of Statistics 40, 2541–2571.
Xue, L. & Zou, H. (2014), ‘Rank-based tapering estimation of bandable correlation matrices’,
Statistica Sinica 24, 83-100.
Yuan, M. & Lin, Y. (2007), ‘Model selection and estimation in the Gaussian graphical model’,
Biometrika 94, 19–35.
35
Page 36
Zhao, T., Roeder, K. & Liu, H. (2012), ‘Smooth-projected neighborhood pursuit for high-
dimensional nonparanormal graph estimation’, Advances in Neural Information Processing
Systems, 162–170.
Zhu, L., Huang, M. & Li, R. (2012), ‘Semiparametric quantile regression with high-
dimensional covariates’, Statistica Sinica, 22, 1379–1401.
Zou, H. & Yuan, M. (2008), ‘Composite quantile regression and the oracle model selection
theory’, The Annals of Statistics 36, 1108–1126.
36