The Lasso with Nearly Orthogonal Latin Hypercube Designs (February 19, 2013) Xinwei Deng Department of Statistics Virginia Tech C. Devon Lin Department of Mathematics and Statistics Queen’s University, Canada Peter Z. G. Qian Department of Statistics University of Wisconsin-Madison Abstract We consider the Lasso problem when the input values need to take multiple lev- els. In this situation, we propose to use nearly orthogonal Latin hypercube designs, originally motivated by computer experiments, to significantly enhance the variable selection accuracy of the Lasso. The use of such designs ensures small column-wise correlations in variable selection and gives flexibility in identifying nonlinear effects in the model. Systematic methods for constructing such designs are presented. The effectiveness of the proposed method is illustrated by several numerical examples. Keywords: Design of experiments; Nearly orthogonal design; Orthogonal array; Space- filling design; Variable selection. 1 INTRODUCTION Experiments with a large number of predictor variables, or called factors in experimental design, are now widely used. In such an experiment often only a few of the predictor variables have significant impact on the response. Identifying those significant factors is known as factor screening in the experimental design literature (Box, Hunter, and Hunter 2005; Yuan, Joseph, and Lin, 2007). For example, Lin and Sitter (2008) reported an experiment from Los Alamos National Laboratory where only 5-7 out of 53 predictor variables are important. 1
24
Embed
The Lasso with Nearly Orthogonal Latin Hypercube Designs · The Lasso with Nearly Orthogonal Latin Hypercube Designs ... we propose to use nearly orthogonal Latin hypercube designs,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
The Lasso with Nearly Orthogonal Latin Hypercube
Designs
(February 19, 2013)
Xinwei Deng
Department of Statistics
Virginia Tech
C. Devon Lin
Department of Mathematics and Statistics
Queen’s University, Canada
Peter Z. G. Qian
Department of Statistics
University of Wisconsin-Madison
Abstract
We consider the Lasso problem when the input values need to take multiple lev-
els. In this situation, we propose to use nearly orthogonal Latin hypercube designs,
originally motivated by computer experiments, to significantly enhance the variable
selection accuracy of the Lasso. The use of such designs ensures small column-wise
correlations in variable selection and gives flexibility in identifying nonlinear effects
in the model. Systematic methods for constructing such designs are presented. The
effectiveness of the proposed method is illustrated by several numerical examples.
Keywords: Design of experiments; Nearly orthogonal design; Orthogonal array; Space-
filling design; Variable selection.
1 INTRODUCTION
Experiments with a large number of predictor variables, or called factors in experimental
design, are now widely used. In such an experiment often only a few of the predictor variables
have significant impact on the response. Identifying those significant factors is known as
factor screening in the experimental design literature (Box, Hunter, and Hunter 2005; Yuan,
Joseph, and Lin, 2007). For example, Lin and Sitter (2008) reported an experiment from
Los Alamos National Laboratory where only 5-7 out of 53 predictor variables are important.
1
Factor screening is a variable selection problem with the advantage that the input values can
be chosen. In the variable selection literature, the Lasso (l1-penalized regression) is a very
popular method (Tibshirani, 1996). This method can be described as follows. Consider the
linear model
y =
p∑j=1
xjβj + ε (1)
where x = (x1, . . . , xp) is the vector of p predictor variables. Here y is the response variable,
β = (β1, . . . , βp) is the vector of regression parameters, and ε is a normal random variable
with mean zero and variance σ2. Suppose the model in (1) has a sparse structure in that
only p0 entries of β are non-zero and the active set
A(β) = {j : βj 6= 0, j = 1, . . . , p} (2)
has p0 elements. Assume, throughout, the response y is centered so that the model in (1)
does not include an intercept. For a given design matrix X containing n rows, x1, . . . ,xn,
in p predictor variables and a given response vector y = (y1, . . . , yn), the Lasso estimate of
β for the model in (1) is
β̂ = arg minβ
[(y −Xβ)T (y −Xβ) + λ‖β‖1
], (3)
where λ ≥ 0 is a tuning parameter and ‖β‖1 is the l1 norm of β defined by ‖β‖1 =∑p
i=1 |βi|.
Because the l1 norm is singular at the origin, some entries of β̂ can be exactly zero given
a properly chosen λ value. A larger value of λ will result in more zero entries in β̂. The
non-zero entries of β̂ in (3) are used for the recovery of the sparse structure of the model in
(1). Therefore, the active set A(β) in (2) can be estimated by
A(β̂) = {j : β̂j 6= 0, j = 1, . . . , p}.
The Lasso solution β̂ in (3) might not be perfect in identifying the true active variables. A
false selection of β̂ can be a false positive or a false negative, where a false positive occurs if
j ∈ A(β̂) but j /∈ A(β) and a false negative takes place if j /∈ A(β̂) but j ∈ A(β). Denote
the number of false selections by γ, i.e.
γ =∑
j
I{j ∈ A(β̂) but j /∈ A(β)}+∑
j
I{j /∈ A(β̂) but j ∈ A(β)}, (4)
2
where I(·) is an indicator function.
Statistical properties, computation and extensions of the Lasso method have been actively
studied in recent years (Tibshirani, 2011). In this article, we study the Lasso from an
experimental design perspective. Both two-level designs and multi-level designs are used
widely in practice (Wu and Hamada, 2009) for different situations depending on the specific
need. In some examples, predictors like two choices of cooling materials are limited to two
levels because of the nature or restriction of the experiment. In other examples, predictors
like pressure and temperature are continuous in nature and need to take multiple levels. It
is hard to directly compare multi-level designs and two-level designs for a given problem.
There are some optimality theory associated with two-level designs (Fedorov and Hackl,
1997). It is not clear whether these results apply to the Lasso problem, especially for the
p > n situation.
Here we consider the Lasso under a multi-level design, which is an n×p matrix X in (3).
This is the situation that each predictor variable needs to take multiple levels. Compared
with a two-level design with the points on the boundary, a multi-level design places points
in the interior and on the boundary of the design space. Multi-level designs can detect some
nonlinear effects in the data (Box and Draper, 1959) and allow certain model inadequacy.
Our motivation is that the structure of X can greatly affect the number of false selections γ in
(4). Obviously, if the columns of X are generated with a dependent structure, it can result
in large pairwise correlations. That is, the collinearity among predictor variables causes
inaccurate variable selection, which results in a large value of γ. Therefore, it is important
to reduce the linear dependency among columns when generating the design matrix X. A
better solution one would suggest is to generate these columns independently using the
simple random sampling method. Although the columns of X generated by this scheme are
stochastically independent, the largest pairwise sample correlations are not guaranteed to
be small especially when the number of predictor variables p is larger than the sample size
n. For illustration, consider a 64× 192 matrix generated from U [0, 1) by using the random
sampling method. Figure 1 depicts the histogram of the sample correlations of the columns
of X, where, in absolute values, 43% of these correlations are larger than 0.1.
3
sample correlation
Density
-0.4 -0.2 0.0 0.2 0.4
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 1. Histogram of the column-wise correlations of a 64 × 192 design generated by the
random sampling from U [0, 1).
To improve upon the simple random sampling method of generating the design matrix
X in (3), we propose to take X to be a nearly orthogonal Latin hypercube design (NOLHD),
originally motivated by computer experiments (Sacks et al., 1989) and used here in the
new application of the Lasso. Such a design is a Latin hypercube design with guaranteed
small column-wise sample correlations. An NOLHD (i) is space-filling in terms of one-
dimensional stratification, and (ii) pursues low correlations among the columns. Because of
the first property, the least squares estimate of an additive model under a Latin hypercube
design has smaller variance than that with a random sample (Owen 1992, Section 3). The
Lasso solution β̂ in (3) inherits this advantage since the model in (1) is additive. Because
of the second property, the Lasso with an NOLHD performs better than its counterpart
with a random Latin hypercube design that can have large column-wise sample correlations
(McKay, Beckman, and Conover, 1979). Furthermore, the sparsity of the model in (1) and
the concept of an NOLHD is seamlessly connected: when X in (3) is an NOLHD in p factors,
its projection onto any p0 active factors gives a smaller NOLHD in p0 factors.
4
The remainder of the article is organized as follows. Section 2 presents two systematic
methods of constructing NOLHDs. The first method is taken from Lin, Mukerjee, and Tang
(2009). The second method is a generalization of the method in Lin, Bingham, Sitter, and
Tang (2010). A new criterion to measure the near orthogonality of an NOLHD for the
Lasso is also introduced in this section. Section 3 provides numerical examples to bear out
the advantage of NOLHDs for the Lasso over random Latin hypercube designs and random
samples. We provide some discussion in Section 4, including the comparison of several
popular optimal two-level designs.
2 METHODOLOGY
In this section, we discuss the construction of NOLHDs and define a new criterion to measure
the orthogonality of NOLHDs for the Lasso problem. Here are some notation and definitions.
Let [ ] denote juxtaposing matrices of the same number of rows column by column. The
Kronecker product of an n× p matrix A = (aij) and an m× q matrix B = (bij) is
A⊗B =
a11B a12B . . . a1pB
a21B a22B . . . a2pB...
.... . .
...
an1B an2B . . . anpB
,where aijB is an m × q matrix with the (k, l) entry aijbkl. The (i, j)th entry of the p × p
correlation matrix ρ of an n× p matrix X = (xij) is
ρij =
∑nk=1(xki − x̄i)(xkj − x̄j)√∑
k(xki − x̄i)2∑
k(xkj − x̄j)2, (5)
which represents the Pearson correlation between columns i and j with x̄i = n−1∑n
k=1 xki
and x̄j = n−1∑n
k=1 xkj. If ρ is an identity matrix, X is orthogonal.
Construction of a random Latin hypercube design (RLHD) Z = (zij) on [0, 1)p of n
runs for p factors consists of (i) generating a matrix D = (dij) with columns being random
permutations of 1, . . . , n, and (ii) using
zij =dij − uij
n, i = 1, . . . , n, j = 1, . . . , p, (6)
5
where the uij are independent U[0,1) random variables, and the dij and the uij are mutually
independent. To define Z on [a, b)p for general a < b, rescale zij in (6) to
(b− a)zij + a. (7)
An NOLHD is a Latin hypercube design with small column-wise correlations. Though
for some values of n and p such designs with exactly orthogonal columns are available,
they exist more generally under the nearly orthogonal condition even for n ≥ p. Popular
criteria for defining the (near) orthogonality of a matrix include the maximum correlation
ρm = maxi,j|ρij| and the root average squared correlation ρave = {∑
i<j ρ2ij/[p(p − 1)/2]}1/2
(Bingham, Sitter, and Tang, 2009), where ρij is given in (5). For the Lasso problem here,
we introduce a new criterion, called the correlation percentage vector. For an integer q and
a vector t = (t1, . . . , tq) with 1 ≥ t1 > t2 > · · · > tq ≥ 0, the correlation percentage vector of
a design X is
δ(X) = (δt1 , . . . , δtq), (8)
where δtk = {p(p− 1)/2}−1∑p
i=1
∑j>i I(|ρij| ≤ tk), for k = 1, . . . , q and I(·) is an indicator
function. Elements of the vector t are generally chosen to be relatively small, say in the
range of (0.1,0.005), to reflect the near orthogonality of a design. For k = 1, . . . , q, δtk
computes the proportion of the |ρij| not exceeding tk. For two designs X1 and X2 of the
same size, X1 is preferred over X2 if δtk(X1) > δtk(X2) for k = 1, . . . , q. This criterion is
more discriminating than the maximum correlation criterion and the root average squared
correlation criterion. Designs with similar ρm or ρave values can have different δ values.
For example, a 64 × 192 matrix X1 generated by the simple random sampling method and
an NOLHD X2 of the same size constructed in Section 2.1 have ρave = 0.124 and 0.112,
respectively, and thus are indistinguishable. But for (t1, t2, t3, t4) = (0.1, 0.05, 0.01, 0.005),
δ(X1) = (0.562, 0.305, 0.064, 0.033) and δ(X2) = (0.906, 0.894, 0.883, 0.883), indicating the
superiority of X2. Examples in Section 3 will demonstrate that the Lasso under designs with
larger correlation percentage vectors appears to give more accurate variable selection.
Sections 2.1 and 2.2 present two easy-to-implement methods for constructing NOLHDs.
The first is taken from Lin, Mukerjee, and Tang (2009) and the second is a generalization
6
of the method in Lin et al. (2010). To assist readers unfamiliar with experimental design,
we describe these methods in a self-contained fashion. Other construction methods for
NOLHDs can be found in Ye (1998), Steinberg and Lin (2006), Pang, Liu, and Lin (2009),
and Sun, Liu, and Lin (2009, 2010), among others. Hereafter, let NOLHD(n, p) denote an
NOLHD with n rows and p columns, where the n levels in each column is taken to be