The Lasso with Nearly Orthogonal Latin Hypercube Designs · The Lasso with Nearly Orthogonal Latin Hypercube Designs ... we propose to use nearly orthogonal Latin hypercube designs,

The Lasso with Nearly Orthogonal Latin Hypercube

Designs

(February 19, 2013)

Xinwei Deng

Department of Statistics

Virginia Tech

C. Devon Lin

Department of Mathematics and Statistics

Queen’s University, Canada

Peter Z. G. Qian

Department of Statistics

University of Wisconsin-Madison

Abstract

We consider the Lasso problem when the input values need to take multiple lev-

els. In this situation, we propose to use nearly orthogonal Latin hypercube designs,

originally motivated by computer experiments, to significantly enhance the variable

selection accuracy of the Lasso. The use of such designs ensures small column-wise

correlations in variable selection and gives flexibility in identifying nonlinear effects

in the model. Systematic methods for constructing such designs are presented. The

effectiveness of the proposed method is illustrated by several numerical examples.

Keywords: Design of experiments; Nearly orthogonal design; Orthogonal array; Space-

filling design; Variable selection.

1 INTRODUCTION

Experiments with a large number of predictor variables, or called factors in experimental

design, are now widely used. In such an experiment often only a few of the predictor variables

have significant impact on the response. Identifying those significant factors is known as

factor screening in the experimental design literature (Box, Hunter, and Hunter 2005; Yuan,

Joseph, and Lin, 2007). For example, Lin and Sitter (2008) reported an experiment from

Los Alamos National Laboratory where only 5-7 out of 53 predictor variables are important.

1

Factor screening is a variable selection problem with the advantage that the input values can

be chosen. In the variable selection literature, the Lasso (l1-penalized regression) is a very

popular method (Tibshirani, 1996). This method can be described as follows. Consider the

linear model

y =

p∑j=1

xjβj + ε (1)

where x = (x1, . . . , xp) is the vector of p predictor variables. Here y is the response variable,

β = (β1, . . . , βp) is the vector of regression parameters, and ε is a normal random variable

with mean zero and variance σ2. Suppose the model in (1) has a sparse structure in that

only p0 entries of β are non-zero and the active set

A(β) = {j : βj 6= 0, j = 1, . . . , p} (2)

has p0 elements. Assume, throughout, the response y is centered so that the model in (1)

does not include an intercept. For a given design matrix X containing n rows, x1, . . . ,xn,

in p predictor variables and a given response vector y = (y1, . . . , yn), the Lasso estimate of

β for the model in (1) is

β̂ = arg minβ

[(y −Xβ)T (y −Xβ) + λ‖β‖1

], (3)

where λ ≥ 0 is a tuning parameter and ‖β‖1 is the l1 norm of β defined by ‖β‖1 =∑p

i=1 |βi|.

Because the l1 norm is singular at the origin, some entries of β̂ can be exactly zero given

a properly chosen λ value. A larger value of λ will result in more zero entries in β̂. The

non-zero entries of β̂ in (3) are used for the recovery of the sparse structure of the model in

(1). Therefore, the active set A(β) in (2) can be estimated by

A(β̂) = {j : β̂j 6= 0, j = 1, . . . , p}.

The Lasso solution β̂ in (3) might not be perfect in identifying the true active variables. A

false selection of β̂ can be a false positive or a false negative, where a false positive occurs if

j ∈ A(β̂) but j /∈ A(β) and a false negative takes place if j /∈ A(β̂) but j ∈ A(β). Denote

the number of false selections by γ, i.e.

γ =∑

j

I{j ∈ A(β̂) but j /∈ A(β)}+∑

j

I{j /∈ A(β̂) but j ∈ A(β)}, (4)

2

where I(·) is an indicator function.

Statistical properties, computation and extensions of the Lasso method have been actively

studied in recent years (Tibshirani, 2011). In this article, we study the Lasso from an

experimental design perspective. Both two-level designs and multi-level designs are used

widely in practice (Wu and Hamada, 2009) for different situations depending on the specific

need. In some examples, predictors like two choices of cooling materials are limited to two

levels because of the nature or restriction of the experiment. In other examples, predictors

like pressure and temperature are continuous in nature and need to take multiple levels. It

is hard to directly compare multi-level designs and two-level designs for a given problem.

There are some optimality theory associated with two-level designs (Fedorov and Hackl,

1997). It is not clear whether these results apply to the Lasso problem, especially for the

p > n situation.

Here we consider the Lasso under a multi-level design, which is an n×p matrix X in (3).

This is the situation that each predictor variable needs to take multiple levels. Compared

with a two-level design with the points on the boundary, a multi-level design places points

in the interior and on the boundary of the design space. Multi-level designs can detect some

nonlinear effects in the data (Box and Draper, 1959) and allow certain model inadequacy.

Our motivation is that the structure of X can greatly affect the number of false selections γ in

(4). Obviously, if the columns of X are generated with a dependent structure, it can result

in large pairwise correlations. That is, the collinearity among predictor variables causes

inaccurate variable selection, which results in a large value of γ. Therefore, it is important

to reduce the linear dependency among columns when generating the design matrix X. A

better solution one would suggest is to generate these columns independently using the

simple random sampling method. Although the columns of X generated by this scheme are

stochastically independent, the largest pairwise sample correlations are not guaranteed to

be small especially when the number of predictor variables p is larger than the sample size

n. For illustration, consider a 64× 192 matrix generated from U [0, 1) by using the random

sampling method. Figure 1 depicts the histogram of the sample correlations of the columns

of X, where, in absolute values, 43% of these correlations are larger than 0.1.

3

sample correlation

Density

-0.4 -0.2 0.0 0.2 0.4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Figure 1. Histogram of the column-wise correlations of a 64 × 192 design generated by the

random sampling from U [0, 1).

To improve upon the simple random sampling method of generating the design matrix

X in (3), we propose to take X to be a nearly orthogonal Latin hypercube design (NOLHD),

originally motivated by computer experiments (Sacks et al., 1989) and used here in the

new application of the Lasso. Such a design is a Latin hypercube design with guaranteed

small column-wise sample correlations. An NOLHD (i) is space-filling in terms of one-

dimensional stratification, and (ii) pursues low correlations among the columns. Because of

the first property, the least squares estimate of an additive model under a Latin hypercube

design has smaller variance than that with a random sample (Owen 1992, Section 3). The

Lasso solution β̂ in (3) inherits this advantage since the model in (1) is additive. Because

of the second property, the Lasso with an NOLHD performs better than its counterpart

with a random Latin hypercube design that can have large column-wise sample correlations

(McKay, Beckman, and Conover, 1979). Furthermore, the sparsity of the model in (1) and

the concept of an NOLHD is seamlessly connected: when X in (3) is an NOLHD in p factors,

its projection onto any p0 active factors gives a smaller NOLHD in p0 factors.

4

The remainder of the article is organized as follows. Section 2 presents two systematic

methods of constructing NOLHDs. The first method is taken from Lin, Mukerjee, and Tang

(2009). The second method is a generalization of the method in Lin, Bingham, Sitter, and

Tang (2010). A new criterion to measure the near orthogonality of an NOLHD for the

Lasso is also introduced in this section. Section 3 provides numerical examples to bear out

the advantage of NOLHDs for the Lasso over random Latin hypercube designs and random

samples. We provide some discussion in Section 4, including the comparison of several

popular optimal two-level designs.

2 METHODOLOGY

In this section, we discuss the construction of NOLHDs and define a new criterion to measure

the orthogonality of NOLHDs for the Lasso problem. Here are some notation and definitions.

Let [ ] denote juxtaposing matrices of the same number of rows column by column. The

Kronecker product of an n× p matrix A = (aij) and an m× q matrix B = (bij) is

A⊗B =

a11B a12B . . . a1pB

a21B a22B . . . a2pB...

.... . .

...

an1B an2B . . . anpB

,where aijB is an m × q matrix with the (k, l) entry aijbkl. The (i, j)th entry of the p × p

correlation matrix ρ of an n× p matrix X = (xij) is

ρij =

∑nk=1(xki − x̄i)(xkj − x̄j)√∑

k(xki − x̄i)2∑

k(xkj − x̄j)2, (5)

which represents the Pearson correlation between columns i and j with x̄i = n−1∑n

k=1 xki

and x̄j = n−1∑n

k=1 xkj. If ρ is an identity matrix, X is orthogonal.

Construction of a random Latin hypercube design (RLHD) Z = (zij) on [0, 1)p of n

runs for p factors consists of (i) generating a matrix D = (dij) with columns being random

permutations of 1, . . . , n, and (ii) using

zij =dij − uij

n, i = 1, . . . , n, j = 1, . . . , p, (6)

5

where the uij are independent U[0,1) random variables, and the dij and the uij are mutually

independent. To define Z on [a, b)p for general a < b, rescale zij in (6) to

(b− a)zij + a. (7)

An NOLHD is a Latin hypercube design with small column-wise correlations. Though

for some values of n and p such designs with exactly orthogonal columns are available,

they exist more generally under the nearly orthogonal condition even for n ≥ p. Popular

criteria for defining the (near) orthogonality of a matrix include the maximum correlation

ρm = maxi,j|ρij| and the root average squared correlation ρave = {∑

i<j ρ2ij/[p(p − 1)/2]}1/2

(Bingham, Sitter, and Tang, 2009), where ρij is given in (5). For the Lasso problem here,

we introduce a new criterion, called the correlation percentage vector. For an integer q and

a vector t = (t1, . . . , tq) with 1 ≥ t1 > t2 > · · · > tq ≥ 0, the correlation percentage vector of

a design X is

δ(X) = (δt1 , . . . , δtq), (8)

where δtk = {p(p− 1)/2}−1∑p

i=1

∑j>i I(|ρij| ≤ tk), for k = 1, . . . , q and I(·) is an indicator

function. Elements of the vector t are generally chosen to be relatively small, say in the

range of (0.1,0.005), to reflect the near orthogonality of a design. For k = 1, . . . , q, δtk

computes the proportion of the |ρij| not exceeding tk. For two designs X1 and X2 of the

same size, X1 is preferred over X2 if δtk(X1) > δtk(X2) for k = 1, . . . , q. This criterion is

more discriminating than the maximum correlation criterion and the root average squared

correlation criterion. Designs with similar ρm or ρave values can have different δ values.

For example, a 64 × 192 matrix X1 generated by the simple random sampling method and

an NOLHD X2 of the same size constructed in Section 2.1 have ρave = 0.124 and 0.112,

respectively, and thus are indistinguishable. But for (t1, t2, t3, t4) = (0.1, 0.05, 0.01, 0.005),

δ(X1) = (0.562, 0.305, 0.064, 0.033) and δ(X2) = (0.906, 0.894, 0.883, 0.883), indicating the

superiority of X2. Examples in Section 3 will demonstrate that the Lasso under designs with

larger correlation percentage vectors appears to give more accurate variable selection.

Sections 2.1 and 2.2 present two easy-to-implement methods for constructing NOLHDs.

The first is taken from Lin, Mukerjee, and Tang (2009) and the second is a generalization

6

of the method in Lin et al. (2010). To assist readers unfamiliar with experimental design,

we describe these methods in a self-contained fashion. Other construction methods for

NOLHDs can be found in Ye (1998), Steinberg and Lin (2006), Pang, Liu, and Lin (2009),

and Sun, Liu, and Lin (2009, 2010), among others. Hereafter, let NOLHD(n, p) denote an

NOLHD with n rows and p columns, where the n levels in each column is taken to be

{−(n− 1)/2,−(n− 3)/2, . . . , (n− 3)/2, (n− 1)/2}.

2.1 A Construction Method Using Orthogonal Arrays

Lin et al. (2009) proposed a method for constructing nearly orthogonal Latin hypercubes

using orthogonal arrays. An orthogonal array OA(n, p, s) of strength two is an n× p matrix

with levels 1, . . . , s such that, for any two columns, all level combinations appear equally

often (Hedayat, Sloane, and Stufken, 1999). Let A be an OA(s2, 2f, s) and let B = (bij) be

an s× p Latin hypercube. This method consists of three steps.

Step 1. For j = 1, . . . , p, obtain an s2 × (2f) matrix Aj from A by replacing the symbols

1, . . . , s in A by b1j, . . . , bsj, respectively, and partition Aj to Aj1, . . . ,Ajf , each of two

columns.

Step 2. Let

V =

[1 −ss 1

].

For j = 1, . . . , p, obtain an s2 × (2f) matrix

Mj = [Aj1V, . . . ,AjfV].

Step 3. For n = s2 and q = 2pf , define an n× q matrix

M = [M1, . . . ,Mp]. (9)

Lemma 1 from Lin et al. (2009) captures the structure of M in (9).

Lemma 1. (a) The matrix M is an s2 × (2pf) Latin hypercube. (b) The correlation

matrix of M is ρ(B)⊗ I2f , where I2f is the identity matrix of order 2f .

Remark 1. For M constructed above, the correlation percentage vector δ in (8) has

δtk = [p(2f − 1) + (p− 1)δtk(B)]/(2pf − 1), for k = 1, . . . , q. (10)

7

Example 1. Let A be an OA(49, 8, 7) from Hedayat et al. (1999) and let B be an

NOLHD(7, 12) given by

−3 0 −1 0 3 3 0 −2 1 −3 −1 −3

−2 −1 1 −3 −1 −3 1 −3 −2 −1 1 3

−1 3 0 3 0 −2 2 0 −1 3 −3 −1

0 −2 3 2 −2 2 −2 1 −3 1 2 −2

1 1 −3 −1 −3 1 −3 −1 3 2 0 1

2 −3 −2 1 1 −1 3 2 2 0 3 0

3 2 2 −2 2 0 −1 3 0 −2 −2 2

,

where ρave = 0.3038, ρm = 0.9643, and δ = (0.500, 0.364, 0.136, 0.136) for t = (0.1, 0.05, 0.01,

0.005). Here, M from Lemma 1 is an NOLHD(49, 96) with ρave = 0.1034 and ρm = 0.9643.

From (10), M has δ = (0.942, 0.926, 0.9, 0.9) for t = (0.1, 0.05, 0.01, 0.005). More generally,

if B is an NOLHD(7, p) for an integer p ≥ 1, Lemma 1 gives an NOLHD(49, 8p).

2.2 A Construction Method Using the Kronecker Product

We propose a generalization of the method in Lin et al. (2010) for constructing NOLHDs

to provide designs with better low-dimensional projection properties than those from the

original method. For j = 1, . . . ,m2, let Cj = (cjik) be an n1 × m1 Latin hypercube and

let Aj = (ajik) be an n1 ×m1 matrix with entries ±1. Let B = (bij) be an n2 ×m2 Latin

hypercube, let D = (dij) be an n2 ×m2 matrix with entries ±1 and let r be a real number.

The proposed method constructs

M =

b11A1 + rd11C1 b12A2 + rd12C2 . . . b1m2Am2 + rd1m2Cm2

b21A1 + rd21C1 b22A2 + rd22C2 . . . b2m2Am2 + rd2m2Cm2

......

. . ....

bn21A1 + rdn21C1 bn22A2 + rdn22C2 . . . bn2m2Am2 + rdn2m2Cm2

. (11)

In contrast, the method in Lin et al. (2010) constructs

L = A⊗B + rC⊗D, (12)

8

where A = (aij) is an n1 × m1 matrix with entries ±1, C = (cij) is an n1 × m1 Latin

hypercube, and B, D and r are as in (11). Lin et al. (2010) provided the conditions for L

to be a nearly orthogonal Latin hypercube. The points of the design in (12) lie on straight

lines when projected onto some factor pairs, which is undesirable for the Lasso problem.

This drawback due to the use of the same matrix A and the same matrix C for each entry

of B and D in (12). The generalization in (11) eliminates this undesirable pattern by using

different matrices A1, . . . ,Am2 and different matrices C1, . . . ,Cm2 . Proposition 1 establishes

conditions for M in (11) to be a Latin hypercube.

Proposition 1. Let r = n2. Then M in (11) is a Latin hypercube if one of the following

conditions holds:

(a) for j = 1, . . . ,m2, the entries of Aj and Cj satisfy that for i = 1, . . . ,m1, cjpi = −cjp′i

and ajpi = aj

p′i;

(b) for k = 1, . . . ,m2, the entries of B and D satisfy that bqk = −bq′k and dqk = dq′k.

Proposition 1 can be verified by using an argument similar to the proof of Lemma 1 in

Lin et al. (2010) and thus is omitted. Proposition 2 studies the orthogonality of M in terms

of the structure of the Aj, B, the Cj and D.

Proposition 2. Suppose the Aj, B, Cj, D and r in (11) satisfy condition (a) or (b) in

Proposition 1 and M in (11) is a Latin hypercube. In addition, assume that the Aj, B and

D are orthogonal, and that BTD = 0 or ATj Cj = 0 holds for all j. Then for M, we have

that

(a) ρm = max{w1ρm(Cj), j = 1, . . . ,m2}, where w1 = n22(n

21 − 1)/(n2

1n22 − 1);

(b) ρave =√w2

∑m2

j=1 ρ2ave(Cj)/m2, where w2 = (m1 − 1)w2

1/(m1m2 − 1);

(c) for t = (t1, . . . , tq) with entries between 0 and 1, δtk(M) ≥∑m2

j=1 δtk(Cj)/m2 for k =

1, . . . , q;

(d) the matrix M is orthogonal if and only if C1, . . . ,Cm2 are all orthogonal.

Proof. Let mjk and mj′k′ be columns (j − 1)m2 + k and (j′ − 1)m2 + k′ of M in (11),

respectively. Take n = n1n2. Let ρ(mjk,mj′k′) be the correlation between mjk and mj′k′

9

defined in (5). Express 12−1n(n2 − 1)ρ(mjk,mj′k′) as

n2∑i1=1

n1∑i2=1

(bi1jaji2k + n2di1jc

ji2k)(bi1j′aj′

i2k′ + n2di1j′cj′

i2k′),

which equals

n2∑i1=1

bi1jbi1j′

n1∑i2=1

aji2kaj′

i2k′ + n2

n2∑i1=1

di1jbi1j′

n1∑i2=1

cji2kaj′

i2k′

+n2

n2∑i1=1

bi1jdi1j′

n1∑i2=1

aji2kcj′

i2k′ + n22

n2∑i1=1

di1jdi1j′

n1∑i2=1

cji2kcj′

i2k′

=n2∑

i1=1

bi1jbi1j′

n1∑i2=1

aji2kaj′

i2k′ + n22

n2∑i1=1

di1jdi1j′

n1∑i2=1

cji2kcj′

i2k′ .

Thus, ρ(mjk,mj′k′) is zero for j 6= j′ and is n22(n

21 − 1)ρkk′(Cj)/(n

2 − 1) for j = j′ and

k 6= k′. By the definitions of ρm and ρave, the results in (a) and (b) hold. For k = 1, . . . , q,

note that

δtk(M) = {m2(m2 − 1)m21 +m1(m1 − 1)

m2∑j=1

δtk(Cj)}/{m1m2(m1m2 − 1)}

=

m2∑j=1

δtk(Cj)/m2 + [(m2 − 1)m1{1−m2∑j=1

δtk(Cj)/m2}]/(m1m2 − 1).

The result in (c) now follows because∑m2

j=1 δtk(Cj)/m2 ≤ 1. Then (d) is evident from (a),

(b) and (c). This completes the proof.

Proposition 2 expresses the near orthogonality of M in (11) in terms of that of the

Cj and establishes conditions of the Aj, B and D in order for M to be an orthogonal

Latin hypercube. The required matrices in Proposition 2 can be chosen as follows. First,

orthogonal matrices useful for obtaining the Aj and D are readily available from Hadamard

matrices when n1 and n2 are multiples of four. Second, orthogonal Latin hypercubes useful

for obtaining B are available from Pang et al. (2009), Lin et al. (2009) and Lin et al. (2010),

among others. If the Aj, B and D are orthogonal, and either BT D = 0 or ATj Cj = 0, then

M is orthogonal when the Cj are orthogonal Latin hypercubes. If the Cj are NOLHDs like

those from Lin et al. (2009) and Lin et al. (2010), then M is nearly orthogonal. If C1 is an

NOLHD, C2, . . . ,Cm2 can be obtained by permuting the rows of C1.

10

Example 2. Let

B =12

1 −3 7 5

3 1 5 −7

5 −7 −3 −1

7 5 −1 3

−1 3 −7 −5

−3 −1 −5 7

−5 7 3 1

−7 −5 1 −3

,D =

1 1 1 1

1 1 −1 −1

1 −1 1 −1

1 −1 −1 1

−1 1 1 1

−1 1 −1 −1

−1 −1 1 −1

−1 −1 −1 1

,

A1 =

1 1 1 1 1 1

1 −1 1 1 1 −1

−1 1 −1 1 1 1

−1 −1 1 −1 1 1

1 −1 −1 1 −1 1

−1 1 −1 −1 1 −1

−1 −1 1 −1 −1 1

−1 −1 −1 1 −1 −1

1 −1 −1 −1 1 −1

1 1 −1 −1 −1 1

1 1 1 −1 −1 −1

−1 1 1 1 −1 −1

, and C1 =12

−11 −9 9 11 5 1

−9 5 −1 −5 −9 11

−7 11 −3 3 1 −7

−5 −1 −9 −9 −1 −9

−3 −7 5 −11 7 −1

−1 9 −7 5 9 5

1 −3 7 −7 −7 3

3 −11 −11 9 −11 −3

5 7 11 7 −5 −5

7 −5 −5 1 11 7

9 1 3 −3 3 −11

11 3 1 −1 −3 9

.

For j = 2, 3, 4, obtain Aj and Cj by permuting the rows of A1 and C1, respectively. Using

the above matrices, M in (11) is a 96× 24 orthogonal Latin hypercube.

Example 3. Let C1 be an NOLHD(25, 24) constructed by Lemma 1 with an OA(25, 6, 5)

from Hedayat et al. (1999) and an NOLHD(5, 4) given by

−2 2 0 0

−1 −2 1 2

0 −1 −2 −1

1 0 2 −2

2 1 −1 1

.

Permute the rows of C1 to get an NOLHD C2. Let A1 and A2 be two 25 × 24 nearly

orthogonal matrices generated by using the Gendex DOE software based on Nguyen (1996).

11

Using

B =

12−1

2

−12

12

and D =

1 1

1 1

,

M in (11) is an NOLHD(50, 48).

3 NUMERICAL ILLUSTRATION

In this section, we provide numerical examples to investigate several different types of designs

by comparing the number of false selections γ in (4) for the Lasso. For a design matrix with n

runs and p factors, Method I uses an NOLHD(n, p) from Section 2, Method II uses an RLHD

from (7), and Method III uses a random sampling method. Denote by γNOLH , γRLHD and

γIID the γ values of the false selections of these methods, respectively. The focus here is to

investigate the effect of the design matrix X on γ, and the same ε = (ε1, . . . , εn) is used to

generate the response y = (y1, . . . , yn) in (1). The tuning parameter λ in (3) is selected by

cross-validation with five folds. The package lars (Efron, Hastie, Johnstone, and Tibshirani,

2003) in R (R, 2010) is used to compute the Lasso solution β̂ in (3).

Example 4. For the model in (1), let p = 48, σ = 8, and β = (0.8, 1.0, . . . , 3, 0, . . . , 0)

with zeros for the last 36 coefficients. For Methods I, II, and III, choose n = 50 with n ≈ p.

Method I takes the NOLHD(50, 48) from Example 3. We compute γNOLH , γRLHD, and γIID

100 times. Table 1 compares three quartiles and Figure 2 depicts the boxplots of the false

selection γ values of these methods, where γNOLH is smaller than γRLHD and γIID.

Table 1. Three quartiles of the false selections γNOLH , γRLHD, and γIID values over the 100

replicates for Example 4

γNOLH γRLHD γIID

median 14.00 19.50 18.00

1st quartile 13.00 15.00 14.00

3rd quartile 16.00 24.00 23.00

12

NOLH LHD IID

510

1520

2530

Figure 2. Boxplots of the false selection values γNOLH , γRLHD, and γIID over the 100

replicates for Example 4.

Example 5. For the model in (1), let p = 96, σ = 8, and β = (0.2, 0.4, . . . , 3, 0, . . . , 0)

with the last 81 coefficients being zero. For Methods I, II, and III, choose n = 98 with

n ≈ p. Method I takes the NOLHD(98, 96) obtained using the method in Section 2.2. Table

2 compares three quartiles of γNOLH , γRLHD, and γIID values over the 100 replicates. Figure

3 depicts the boxplots of the false selection γ values of these methods. Once again, γNOLH

is significantly smaller than γRLHD and γIID.



γNOLH γRLHD γIID

median 22.00 28.50 30.00

1st quartile 19.00 24.00 25.75

3rd quartile 25.00 32.00 35.00

13

NOLH LHD IID

1520

2530

3540

45

Figure 3. Boxplots of the false selection values γNOLH , γRLHD, and γIID values over the 100


3.1 Designs with p > n

Example 6. For the model in (1), let p = 96, σ = 8 and β = (0.2, 0.4, . . . , 3, 0, . . . , 0)

with the last 81 coefficients being zero. For Methods I, II, and III, take n = 49 with p > n.

Method I uses the NOLHD(49, 96) in Section 2.1 . Table 3 compares three quartiles of γNOLH ,

γRLHD, and γIID values over the 100 replicates. Figure 4 depicts the boxplots of the false

selection γ values of these methods. Here Method I significantly outperforms Method II and

Method III as γNOLH is much smaller than γRLHD and γIID.



γNOLH γRLHD γIID

median 20.00 25.00 25.00

1st quartile 15.00 21.00 22.00

3rd quartile 25.00 28.25 28.00

14

NOLH LHD IID

1015

2025

3035



Example 7. For the model in (1), let p = 192, σ = 8 and β = (0.05, 0.2, . . . , 3, 0, . . . , 0)

with the last 172 coefficients being zero. For Methods I, II, and III, take n = 64 with p > n.

Method I uses an NOLHD(64, 192) constructed by Lemma 1 in Section 2.1. Table 4 compares

three quartiles of γNOLH , γRLHD, and γIID values over the 100 replicates. Figure 5 depicts

the boxplots of the false selection γ values for these methods, where γNOLH is much smaller

than γRLHD and γIID. Using an NOLHD leads to significant improvement of the Lasso for

this small n, large p case.


replicates in Example 7

γNOLH γRLHD γIID

median 26.50 42.00 40.00

1st quartile 21.00 38.00 32.00

3rd quartile 35.00 46.00 45.25

15

NOLH LHD IID

2030

4050



These examples all suggest that the Lasso solution with an NOLHD gives more accurate

variable selection than those of the two competing designs. Comparing Figures 2–5 indicates

that the advantage of NOLHDs grows as the ratio p/n increases.

3.2 Multi-level Designs for Accommodating Nonlinear Effects

We now illustrate the advantage of the proposed multi-level designs over two-level designs

in variable selection with nonlinear effects. Multi-level designs can identify some nonlinear

effects and thus improve the accuracy of Lasso variable selection. We consider Method I∗,

a modified Method I using the proposed multi-level design for Lasso variable selection. The

proposed designs can learn nonlinear effects by methods like the main-effect plots (Box,

Hunter and Hunter, 2005) and then fit the Lasso under a linear model after subtracting

those effects from the response. Specifically, Method I∗ first regresses the response against

some predictors that are known to have possible nonlinear effects and then uses the residuals

as the new response to fit the Lasso with the remaining predictors. In addition to Methods

16

I-III, we also include Method IV and Method V which consider designs with two levels

±(n − 1)/2. The Method IV uses a random two-level design for the Lasso. The random

two-level design is generated by randomly assigning each entry with one of the two levels.

The Method V is to use a two-level Bayesian D-optimal supersaturated designs (Jones, Lin

and Nachtsheim, 2008) for the Lasso. The supersaturated designs are designs whose number

of factors exceeds the run size. It needs to be made clear that no D-optimal supersaturated

design under the Lasso model is currently available and all such designs are constructed

based on various linear model criteria. Denote γNOLH∗, γRAND, γSSD as the value of false

selections of Method I∗, Method IV, and Method V, respectively.

Example 8. Let p = 48, σ = 8, and the sample size n = 50. To accommodate the nonlinear

effects, we generate the response using the following model,

y = 0.8x1 + 1.0x2 + 1.2x3 + 1.4x4 + 1.6x5 + 1.8x246 + 2.0x3

47 + 2.2sin(x48) + ε. (13)

The nonlinear effects are unknown, and such information cannot be used in the design stage.

We consider Lasso with the linear model in (1) to fit the data. For Method I, we use an

NOLHD(50, 48) for constructing the design matrix. Method I∗ uses the same design matrix as

in Method I, but has identified the nonlinear effects and subtracting them from the response

before fitting the Lasso. Table 5 compares the three quartiles of γNOLH∗, γNOLH , γRLHD,

γIID, γRAND, and γSSD values over the 100 replicates. Figure 6 depicts the boxplots of the

γ values for the six methods.

Table 5. Three quartiles of the false selections γNOLH∗, γNOLH , γRLHD, γIID, γRAND, and

γSSD values over the 100 replicates in Example 8

γNOLH∗ γNOLH γRLHD γIID γRAND γSSD

median 12.00 15.00 19.00 23.00 29.00 27.00

1st quartile 11.00 13.00 18.00 20.75 28.00 26.00

3rd quartile 13.00 16.00 22.00 25.25 31.00 29.00

Example 9. The setting of this example is the same as Example 8. But p = 96 and the

17

NOLH* NOLH LHD IID Rand SSD

1015

2025

3035

Figure 6. Boxplots of the false selection values γNOLH∗ γNOLH , γRLHD, γIID, γRAND, and

γSSD values over the 100 replicates for Example 8.

sample size n = 49. The response is generated from the following model,

y =12∑

k=1

αkxk + 2.6x294 + 2.8x3

95 + 3.0sin(x96) + ε. (14)

with αk = 0.2k. For Method I and Method I∗, we use an NOLHD(49, 96) for constructing the

design matrix. We compare Method I and Method I∗ with Methods II-V. Table 6 reports

the three quartiles of γNOLH∗, γNOLH , γRLHD, γIID, γRAND, and γSSD values over the 100

replicates. Figure 7 depicts the boxplots of the γ values for the six methods.

Table 6. Three quartiles of the false selections γNOLH∗, γNOLH , γRLHD, γIID, γRAND, and

γSSD values over the 100 replicates in Example 9

γNOLH∗ γNOLH γRLHD γIID γRAND γSSD

median 23.00 26.00 32.00 36.00 41.00 44.00

1st quartile 21.75 24.00 29.00 33.00 38.75 43.00

3rd quartile 25.00 27.00 34.00 38.00 44.00 45.00

18

NOLH* NOLH LHD IID Rand SSD

2025

3035

4045

Figure 7. Boxplots of the false selection values γNOLH∗ γNOLH , γRLHD, γIID, γRAND, and

γSSD values over the 100 replicates for Example 9.

Examples 8-9 indicate that the proposed Method I∗ outperforms Method I. Method I

directly applies the Lasso with the linear model to fit the data containing nonlinear effects.

Taking advantage of multi-level designs to identify the nonlinear effects and filter them out,

Method I∗ improves the variable selection accuracy. We also observed that both Method I

and Method I∗ provide more accurate variable selection than Methods II-V. This indicates

that the proposed design still performs well in variable selection with the presence of model

inadequacy.

4 DISCUSSION

We have proposed to use NOLHDs, originally developed in computer experiments, to signifi-

cantly enhance the accuracy of the Lasso variable selection. The effectiveness of this method

has been illustrated by several examples in Section 3.

In this work, we have considered the situation where the factors need to take multiple

levels. For situations that factors are restricted to two levels, there are optimal two-level

19

designs based on certain criteria when the model is adequate. However, these existing optimal

two-level designs are not developed for variable selection and might not work well.

To elaborate this point, we evaluate the accuracy of Lasso variable selection under three

different types of designs with two levels ±(n− 1)/2: Method IV, random two-level designs;

Method V, Bayesian D-optimal supersaturated designs described in Section 3.2; and Method

VI, balanced two-level designs. The Bayesian D-optimal supersaturated designs are gener-

ated by using the statistical software JMP. Popular methods of constructing supersaturated

designs include Lin (1993), Wu (1993), Nguyen (1996), Jones, Lin and Nachtsheim (2008).

A balanced two-level design of n runs for p factors is an n × p matrix in which the two

levels appear equally often in each column. For a given design matrix X, the responses

y = (y1, . . . , yn) are generated from the model in (1). Denote by γSSD, γBLAN , and γRAND

the γ values of false selections associated with Methods IV-VI, respectively. We consider

three different examples as follows.

Example 10. For the model in (1), let n = 50, p = 48, σ = 8, and β = (0.8, 1.0, . . . , 3, 0, . . . ,

0) with zeros for the last 36 coefficients.

Example 11. For the model in (1), let n = 49, p = 96, σ = 8 and β = (0.2, 0.4, . . . , 3, 0, . . . ,

0) with the last 81 coefficients being zero.

Example 12. For the model in (1), let n = 64, p = 192, σ = 8 and β = (0.05, 0.2, . . . , 3, 0, . . .

, 0) with the last 172 coefficients being zero.

Table 7 compares three quartiles of γSSD, γBLAN , and γRAND values over the 100 replicates

for Examples 10-12, respectively. The results show that the Bayesian D-optimal supersatu-

rated designs (SSD) outperform the other two designs in Examples 11-12 but not in Example

10. This mixed performance can be due to the fact that the D-optimality criterion of such

designs was developed for ordinary least squares, not for the Lasso variable selection.

The settings of Examples 10-12 are the same as Examples 4, 6, and 7, respectively, which

compared the proposed multi-level designs. Comparing these six examples, the Lasso under

the proposed multi-level designs generally perform better than that under the two-level

Bayesian D-optimal supersaturated designs.

20

Table 7. Quartiles of the false selections γSSD, γBLAN , and γRAND values over the 100

replicates for Examples 10-12, respectively

γSSD γBLAN γRAND

Example 10

median 19.00 19.00 18.00

1st quartile 15.00 15.00 16.00

3rd quartile 24.25 23.00 22.00

Example 11

median 20.00 24.00 25.00

1st quartile 17.00 21.00 22.00

3rd quartile 23.25 27.00 28.00

Example 12

median 27.00 42.00 38.00

1st quartile 24.00 36.00 32.00

3rd quartile 32.25 45.25 43.00

Constructing two-level optimal design for the Lasso can be an interesting topic for future

research. One possibility is to extend the ideas behind Meyer, Steinberg, and Box (1996)

and Bingham and Chipman (2007). But this new problem poses significant challenges since

the effect of X on the number of false selections γ cannot be expressed in a closed form.

Some progress in this direction is made in Xing et al. (2011).

Acknowledgement

Lin is supported by grants from the Natural Sciences and Engineering Research Council

of Canada. Qian is supported by National Science Foundation Grant DMS-0969616. The

authors wish to thank Hugh Chipman for useful suggestions. We also like to thank the editor

and associate editor for their constructive comments.

21

References

Bingham, D., and Chipman, H. A. (2007), Incorporating Prior Information in Optimal

Design for Model Selection, Technometrics, 49, 155–163.

Bingham, D., Sitter, R. R., and Tang, B. (2009), Orthogonal and Nearly Orthogonal Designs

for Computer Experiments, Biometrika, 96, 51–65.

Box, G. E. P., and Draper, N. R. (1959), A Basis for the Selection of a Response Surface

Design, Journal of the American Statistical Association, 54, 622–654.

Box, G. E. P., Hunter, W. G., and Hunter, J. S. (2005), Statistics for Experimenters:

Design, Innovation, and Discovery, 2nd Edition, New York: John Wiley & Sons.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani R. (2003), Least Angle Regression,

Annals of Statistics, 32, 407–499.

Fedorov, V., and Hackl, P. (1997). Model-Oriented Design of Experiments. Lecture Notes

in Statistics, 125, New York: Springer-Verlag.

Hedayat, A. S., Sloane, N. J. A., and Stufken, J. (1999), Orthogonal Arrays: Theory and

Applications, New York: Springer-Verlag.

Jones. B., Lin, D. K. J., and Nachtsheim, C. J. (2008), Bayesian D-optimal Supersaturated

Designs, Journal of Statistical Planning and Inference, 138, 86–92.

Lin, D. K. J. (1993), A New Class of Supersaturated Designs, Technometrics, 35, 28–31.

Lin, C. D., Bingham, D., Sitter, R. R., and Tang, B. (2010), A New and Flexible Method for

Constructing Designs for Computer Experiments, Annals of Statistics, 38, 1460–1477.

Lin, C. D., Mukerjee, R., and Tang, B. (2009), Construction of Orthogonal and Nearly

Orthogonal Latin Hypercubes, Biometrika, 96, 243–247.

Lin, C. D. and Sitter, R. R. (2008), An Isomorphism Check for Two-level Fractional Fac-

torial Designs, Journal of Statistical Planning and Inference, 138, 1085-1101.

22

McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), A Comparison of Three Methods

for Selecting Values of Input Variables in the Analysis of Output from a Computer

Code, Technometrics, 21, 239–245.

Meyer, R. D., Steinberg, D. M., and Box, G. E. P. (1996), Follow-up Designs to Resolve

Confounding in Multifactor Experiments, Technometrics, 38, 303–313.

Nguyen, N. (1996), A Note on Constructing Near-Orthogonal Arrays with Economic Run

Size, Technometrics, 38, 279–283.

Owen, A. B. (1992), A Central Limit Theorem for Latin Hypercube Sampling, Journal of

the Royal Statistical Society, Series B, 54, 541–551.

Pang, F., Liu, M. Q., and Lin, D. K. J. (2009), A Construction Method for Orthogonal

Latin Hypercube Designs with Prime Power Levels, Statistica Sinica, 19, 1721–1728.

R (2010), The R Project for Statistical Computing.

Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P. (1989), Design and Analysis of

Computer Experiments, Statistical Science, 49, 409–423.

Steinberg, D. M., and Lin, D. K. J. (2006), A Construction Method for Orthogonal Latin

Hypercube Designs, Biometrika, 93, 279–288.

Sun, F., Liu, M. Q., and Lin, D. K. J. (2009), Construction of Orthogonal Latin Hypercube

Designs, Biometrika, 96, 971–974.

Sun, F., Liu, M. Q., and Lin, D. K. J. (2010), Construction of Orthogonal Latin Hypercube

Designs with Flexible Run Sizes, Journal of Statistical Planning and Inference, 140,

3236–3242.

Tibshirani, R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the

Royal Statistical Society, Series B, 58, 267–288.

Tibshirani, R. (2011), Regression Shrinkage and Selection via the Lasso: A Retrospective,

Journal of the Royal Statistical Society, Series B, 73, 273–282.

23

Xing, D., Wan, H., and Zhu, Y. (2011), Optimal Supersaturated Design for Variable Selec-

tion Via Lasso, working paper, Purdue University.

Yuan, M., Joseph, V.R., and Lin, Y. (2007), An Efficient Variable Selection Approach for

Analyzing Designed Experiments, Technometrics, 49, 430–439.

Wu, C. F. J. (1993), Construction of Supersaturated Designs through Partially Aliased

Interactions, Biometrika, 80, 661–669.

Wu, C. F. J., and Hamada, M. (2009), Experiments: Planning, Analysis, and Parameter

Design Optimization, 2nd Edition, New York: John Wiley & Sons.

Ye, K. Q. (1998), Orthogonal Column Latin Hypercubes and Their Application in Computer

Experiments, Journal of the American Statistical Association, 93, 1430–1439.

24

The Lasso with Nearly Orthogonal Latin Hypercube Designs · The Lasso with Nearly Orthogonal Latin Hypercube Designs ... we propose to use nearly orthogonal Latin hypercube designs,

Documents