FACTOR ANALYSIS FOR HIGH-DIMENSIONAL DATA A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Jingshu Wang July 2016
116
Embed
FACTOR ANALYSIS FOR HIGH-DIMENSIONAL DATA A …owen/students/JingshuWangThesis.pdf · his incredible patience and timely wisdom, my thesis work would not have gone so smoothly. In
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
4.1 REE survival plots for estimating X: the proportion of samples with REE exceeding
the number on the horizontal axis. Figure 4.1a shows all 6000 samples. Figure 4.1b
shows only the 3000 simulations of larger matrices of each aspect ratio. Figure 4.1c
shows only the 3000 simulations of smaller matrices. . . . . . . . . . . . . . . . . . . 64
5.1 Compare the performance of nine different approaches (from left to right): naive regression
ignoring the confounders (Naive), IRW-SVA, negative control with finite sample correction
(NC) in eq. (5.17), negative control with asymptotic oracle variance (NC-ASY) in eq. (5.18),
RUV-4, robust regression (LEAPP(RR)), robust regression with calibration (LEAPP(RR-
MAD)), LEAPP, oracle regression which observes the confounders (Oracle). The error bars
are one standard deviation over 100 repeated simulations. The three dashed horizontal lines
from bottom to top are the nominal significance level, FDR level and oracle power, respectively. 86
xii
Chapter 1
Introduction
Factor analysis is a statistical method to explain a large number of interrelated variables in terms
of a potentially low number of unobserved variables. In another point of view, it approximates
a matrix shaped data set by a low rank matrix via an explicit probabilistic linear model. Factor
analysis reduces the complexity and reveals the underlining structure of the data set.
Factor analysis is over a century old. In psychology, the factor model dates back at least to
Spearman (1904), who is sometimes credited with the invention of factor analysis. The technique
is later also applied to social science, economics, finance and marketing, signal processing, bioinfor-
matics and etc. The latent factors discovered by factor analysis make the observed variables more
understandable. Typically, factor analysis is classified into two types. One is confirmatory factor
analysis which has pre-determined constraints on factor loadings (for example, the loading of the
observed factor V1 on latent factor F1 is 0). The other is explanatory factor analysis which does not
have such constraints.
More recently, factor analysis also becomes a widely used dimension reduction tool for analyzing
large matrices and high-dimensional data. Factor analysis shares a lot of similarities with low-rank
matrix approximation which has applications in fields such signal processing, collaborative filtering
and personalized learning. Compared with principal component analysis (PCA) or the singular
value decomposition (SVD), factor analysis assumes heteroscedastic noise for each variable, which is
a more reasonable assumption than constant noise variance in many applications. The challenge is
that in those data sets, the dimensionality is often comparable or even larger than the sample size,
thus new methodology and theoretical analysis for solving the model need to be established.
A problem with factor analysis is that it is surprisingly difficult to choose the number of factors.
Even in traditional factor analysis problems which have a small number of variables but a relatively
large sample size, there is no widely agreed best performing methods (see for example Peres-Neto
et al. (2005)). Classical methods such as hypothesis testing based on likelihood ratios (Lawley, 1956)
or methods based on information theoretic criteria (Wax and Kailath, 1985) assume homoscedastic
1
CHAPTER 1. INTRODUCTION 2
noise and thus not fit the heteroscedastic noise assumption for factor analysis directly. In addition,
since these classical methods assume an asymptotic regime with a growing number of observations
and fixed number of variables, they do not perform well on large matrices where both dimensions are
large in modern applications. Modern methods that are developed assuming both the dimensions
are large include modified information criteria method in the econometrics community assuming
strong factors and random matrix theory based methods assuming weak factors and homoscedastic
noise.
The rest of this chapter includes a description of the mathematical model and assumptions of
factor analysis, and a discussion of model identifiability.
1.1 Forms of Factor Analysis Model
Let N denote the number of variables and n denote the sample size. Then the observation yij for
the ith variable and the jth sample is assumed to have the following decomposition:
yij =
r∑k=1
likfkj + σieij (1.1)
where E = (eij)N×n is the noise matrix, Fk = (fk1, fk2, · · · , fkn)T denotes the kth latent variable
and L = (lik)N×r is called the factor loading matrix. Denote each observed variable as Yi =
(yi1, yi2, · · · , yin)T and the noise associated as Ei = (ei1, ei2, · · · , ein)T , then the vector form of
(1.1) is
Yi =
r∑k=1
likFk + σiEi (1.2)
This exactly shows that all the observed variables can be explained by linear combinations of r
common factors. Usually, r is much smaller than N , thus estimating the latent common factors
makes the data more interpretable. Let Y = (yij)N×n be the data matrix and F = (fkj)r×n be the
factor score matrix, then the matrix form of (1.1) is
Y = LF + Σ1/2E. (1.3)
This has the interpretation that the data matrix can be expressed as a low-rank signal matrix
X = LF plus noise. Thus, factor analysis model can be used when a low-rank approximation of the
data matrix is desired.
CHAPTER 1. INTRODUCTION 3
1.2 Model assumptions
As discussed in Anderson and Rubin (1956), the factor score matrix F can be treated to be either
random or non-random.
1.2.1 Random factor score matrix
Usually, if we think that the columns of Y are randomly and independently drawn from the popu-
lation, we may prefer to assume that F is random to reduce the number of parameters to estimate.
Typically the following assumptions are made:
Assumption 1. For a random factor score model
(a) The factor scores F and the noise E are random, while the factor loading matrix L is non-
random.
(b) F and E are independent: F |= E.
(c) For each latent variable k, fkj , j = 1, 2, · · · , n are i.i.d. with E(fkj) = µk. Also, we assume
Cov(F·j) = ΛF where ΛF ∈ Rr×r is some positive-semidefinite matrix.
(d) For each variable i, the noises ei1, ei2, · · · , ein are i.i.d, with E(eij) = ai and Var(eij) = 1.
Also, we assume Σ = diag(σ21 , σ
22 , · · · , σ2
N ).
Let αi =∑rk=1 likµk + ai and , then
E(yij) = αi, Cov(Y·j) = ΣY = LΛFLT + Σ. (1.4)
An equivalent way to write out (1.1) is
yij = αi +
r∑k=1
likfkj + σieij (1.5)
with the assumptions E(fkj) = 0 and E(eij) = 0. This form is more common to be found in classical
factor analysis literature.
It is also often assumed that the entries of both F and E follow Gaussian distributions. The
advantage of Gaussian assumptions is that then only the first and second moments of the data
matter in estimation and inference. Then from (1.4), only αi and LΛFLT + Σ are identifiable. We
will discuss in more detail about identification of the components in LΛFLT + Σ in Section 1.3.
Sometimes, it’s more reasonable to assume that the factor scores of the individuals (columns of
F ) are also correlated. For example, the individuals can be time series or spatial points. This is a
common assumption when factor analysis is applied to economics or spatial analysis (Forni et al.,
2000; Wang and Wall, 2003).
CHAPTER 1. INTRODUCTION 4
1.2.2 Non-random factor score matrix
We may tend to assume non-random F when the distributional assumption of F is too complicated
or an estimation of the low rank matrix X = LF is easier than estimating the factors themselves.
For example, the low-rank constraint on X can be relaxed to a nuclear norm constraint, which
enables good optimization algorithms for solving the model (Chapter 4). Another situation is that
the samples are assumed to have an unknown clustering structure and within the clusters samples are
correlated. In this scenario, a random factor score assumption will make the model too complicated
to be solved, thus a non-random factor score would be preferable.
For a non-random F model, both the signal matrix X and the noise covariance matrix Σ are
parameters. Compared with the random F assumptions, the model now has model parameters to
estimate. However, when r min(N,n), there will be still enough data to compensate for the extra
degrees of freedom.
Assumption 2. For a non-random factor score model
(a) The noises E are random, while both the factor loading L and factor score F are non-random.
(b) For each variable i, the noises ei1, ei2, · · · , ein are i.i.d, with E(eij) = ai and Var(eij) = 1.
Also, we assume Σ = diag(σ21 , σ
22 , · · · , σ2
N ).
As the random factor score model, the non-random factor score model can also be rewritten as
Y = α1Tn + LF + Σ1/2E (1.6)
with the additional constraint that F · 1n = 0.
Non-random and random factor score assumptions are closely related to each other. A random
factor score model becomes a non-random factor score model when we make inference conditional
on F . On the other hand, a non-random factor score model turns into a random factor score model
by adding a prior on F (similar to a random effect model in linear regression). We shall see that
in general, the asymptotic results for N,n→∞ would be very similar for random and non-random
factor score models.
For both models, there can be extra constraints imposed on the factors (either the factor loadings
or factor scores) based on the application problems. A typical example is confirmatory factor anal-
ysis. Typically, in confirmatory factor analysis, it is assumed that the loadings on specific entries
are zero, reflecting the structure of the relationships between observed and unobserved factors based
on researchers’ knowledge about the problem (Hoyle, 2000; Anderson and Gerbing, 1988). Another
popular constraint is assuming sparsity on factor loadings and/or factor scores, which is a common
constraint for improving interpretability and estimability of the factors for analyzing large matrices
(Shen and Huang, 2008; Carvalho et al., 2012). Another constraint is assuming non-negativity. For
example in educational research, the observed variables can be scores on questions and the latent
CHAPTER 1. INTRODUCTION 5
factors are the underling concept. The factor scores are then interpreted as understanding on certain
concept which are more interpretable if non-negativity is assumed (Martens, 1979; Smaragdis and
Brown, 2003).
1.3 Model Identification
Model identification is generally a hard problem for factor analysis and has been discussed for a long
time. Here we list several classical results to discuss the problem. In this section we assume that for
any of the random variables, its parameters can be identified if and only they can be identified via
the first two moments of the variable.
First, when the number of factors r is unknown, there is an identification problem for r. If r is
unknown, we can always set r = N and Σ = 0 and the model becomes a trivially correct model. To
avoid this, we define r as the minimum integer that the factor model (either under the assumption
of random or non-random factor scores) exists. Since r = N also provides a correct model, this
definition itself automatically guarantees the uniqueness of r.
We assume normality for all the random variables, and discuss identification of the model pa-
rameters for both random and non-random factor score models.
1.3.1 Identification for random factor score model
For random factor score models, more constraints are needed for the identification of each elements
of the covariance matrix LΛFLT + Σ. First, we show a sufficient condition discussed in Anderson
and Rubin (1956) for identification of Φ = LΛFLT and Σ given r.
Theorem 1.3.1. Under Assumption 1, a sufficient condition for identification of Σ and Φ = LΛFLT
is that if any row of Φ is deleted, there remain two disjoint subsets of rows of Φ of rank r.
When Σ can be uniquely defined, it is obvious to see that L and ΛF are still unidentifiable.
Actually, given any invertable r×r matrix U , replacing L with L = LU and ΛF with ΛF = U−1ΛFUT
will keep ΣY unchanged. One common constraint to make L identifiable up to rotation is to assume
ΛF = Ir. This is assuming that the latent factors are uncorrelated (under the Gaussian independent)
to each other and are normalized. Some further restrictions can be added to eliminate the rotation
uncertainty. For example, common assumptions are assuming either LTL or LTΣ−1L is diagonal
with distinct entries, thus L can be uniquely identified via the eigenvalues and eigenvectors of Φ (if
diagonality of LTL is assumed) or Σ−1/2ΦΣ−1/2 (if diagonality of LTΣ−1L is assumed). Usually,
both the orthogonality and diagonality constraints mentioned above may not represent the properties
of the actual factors, but just for mathematical convenience.
Assumption 3. ΛF = Ir and either LTL or LTΣ−1L is diagonal with distinct diagonal entries.
CHAPTER 1. INTRODUCTION 6
Let’s now discuss the identification of L and ΣF from Φ under sparsity assumption. This is
equivalent to unique determination of U up to scaling and row/column permutation of the identity
matrix for L = LU . We state a simplified and generalized version of the classical result in Reiersøl
(1950). We define the s-sparse family of L (we require s ≥ r):
L(s) =L ∈ RN×r : L satisfies conditions 1 and 2
.
The conditions I and II are stated as follows:
(I) L is of rank r and each column of L contains at least s zeros.
(II) For each column m, let Lm be the matrix consisting of all rows of L which have a zero in the
mth column. For any m = 1, 2, · · · , r, Lm is of rank r − 1.
The above two conditions requires all the factors to be sparse, though besides sparsity it should be
quite full rank. Then a necessary and sufficient condition of L being identifiable in L is:
Theorem 1.3.2. Under Assumption 1, the normality assumption and the identification conditions
in Theorem 1.3.1, a necessary and sufficient condition for L in L(s) being identifiable up to scaling
and row/column permutation is that if a sub-matrix L? ∈ Rs×r of L is of the rank of r − 1, then it
must be the sub-matrix of Lm for some m = 1, 2, · · · , r.
Remark. The original theorem in Reiersøl (1950) (also stated in Anderson and Rubin (1956))
stated a different result compared with Theorem 1.3.2. In Reiersøl (1950), we have s = r and a
narrower parameter space L?(r) is assumed with two further restrictions on Lm: (III) the rank of
Lm with one row deleted is still r − 1 and (IV) the rank of Lm with one of other rows of L added
becomes r. As a consequence, the necessary and sufficient condition changes to that L does not
contain any other submatrices satisfying (II)-(IV). Theorem 1.3.2 defines a larger parameter space
L(s) for s = r which is more meaningful for practical usage. Also, Theorem 1.3.2 generalize the
original result to any sparsity level s. An increase of s weakens the identification condition.
Proof. As discussed, we only need to show that for L = LU , if L, L ∈ L, the condition in the
theorem is a necessary and sufficient condition for U having exactly one non-zero in each row and
each column.
Sufficiency: Since L has rank r, U must be full rank and L = LU−1. For any given m ∈ 1, 2, · · · , r,as the rank of Lm is r − 1, there must exist an s × r sub-matrix L? of Lm that is of rank r − 1,
then L?U ∈ Rs×r also has rank r − 1. Since L?U is a sub-matrix of L, then given the condition,
one column, say jm of L?U must all be zero. Let L?·(m) be the sub-matrix of L? excluding the mth
column. Since L?·(m) ∈ Rs×(r−1) is of rank r− 1, then the entries of jmth column except for the mth
row of U must all be zero.
It’s easy to show that if m1 6= m2, then jm1 6= jm2 , thus U has exactly one non-zero in each row
and each column. The sufficiency of the condition is proved.
CHAPTER 1. INTRODUCTION 7
Necessity: If the condition in the theorem is not satisfied, then there exists a sub-matrix L? ∈ Rs×r
of L that has rank r − 1 but none of its column is all zero. Thus, ∃v ∈ Rr that has at least
two non-zero entries and L?v = 0. W.L.O.G, assume that the first entry of v is non-zero. Let
V−1 = (0 Ir−1)T ∈ Rr×(r−1) and U = (v V−1) ∈ Rr×r. Then U has rank r and it’s easy to check
that LU ∈ L. Thus, L is not identifiable. The necessity of the condition is proved.
1.3.2 Identification for non-random factor score model
For non-random factor score models, we need constraints to identify the signal matrix X and noise
covariance Σ given r first, and then constraints to identify F and L in X = LF .
To identify X and Σ of model Y = X + Σ1/2E, we need to guarantee that if Y = X ′ + Σ′1/2E′
with X ′ having rank r, Σ′ diagonal E′ a random matrix with i.i.d. standard Gaussian entries, then
X ′ = X and Σ′ = Σ. First, if r = N , then the model is trivially unidentifiable. Thus, we need
to have r < N . We give a necessary condition for identifying X and Σ. We find it hard to give a
sufficient condition.
Theorem 1.3.3. Assume r < N . Under Assumption 2 and a known r, a necessary condition for
identifying X and Σ is that if any row of X is removed, the remaining matrix is still of rank r.
Proof. Suppose that there exists one row j that if the jth row is removed, the remaining matrix
X(j)· still has rank k < r. Let the remaining matrix of L after removing the jth row be L(j)·,
then L(j)· ∈ R(p−1)×r also has rank k, thus it is degenerate. Thus, there exists a non-zero vector
v ∈ Rr that L(j)·v = 0. Since L is full rank, Lv 6= 0. Thus Lv has only one non-zero entry. Let
X ′ = X +LvET1 where E1 ∈ Rn is a random vector with i.i.d. standard Gaussian entries which are
also independent from E. Then X ′ is still of rank r, and Σ′ = Σ− LvvTLT is still diagonal. Thus
X and Σ are not identifiable. This proves that the condition is necessary.
After identification of X and Σ, similar to the random factor score cases, we can impose more
restrictions for identification of the decomposition X = LF . The restrictions are similar to Theorem
1.3.1 and Theorem 1.3.2. For the rotation restriction, we can refer to the five scenarios listed in Bai
and Li (2012a). For the sparsity restriction, either a sparsity assumption on L or F will be sufficient
for identification.
Assumption 4. MF = FFT /n = Ir, either LTL or LTΣ−1L is diagonal with distinct diagonal
entries.
Though we have discussed the sparse factor assumptions, we will focus on estimating unre-
stricted factor analysis model in Chapters 2 to 5. The identification condition of sparse factors
in Theorem 1.3.2 is closely related to the model identification of confounder adjustment models
(Corollary 5.1.1) that we will discuss in Chapter 5.
Chapter 2
Background
In this chapter, we will discuss some previous methods. We will review using the maximum likelihood
method (MLE) and principal component analysis (PCA) in estimation of the factor loadings/signal
matrix and the noise variances given r. We discuss them for both the random and non-random factor
score models. Also, we summarize their properties under both the classical and high-dimensional
data asymptotic regions assuming that the number of factors r is fixed and known. Finally, we will
review previous methods in estimating r, which can be a much harder problem than estimating the
factors loading parameters themselves.
2.1 The maximum likelihood method
2.1.1 Random factor scores model
Under Assumption 1 and normality of the random variables, the log-likelihood of the data matrix
Y in (1.5) can be written as
L(Y ;α,L,ΣF ,Σ) =−Nn log(2π)− n
2log det |LΛFL
T + Σ|
− 1
2tr[(Y T − 1nα
T )(LΛFLT + Σ)−1(Y − α1n
T )] (2.1)
where 1 represents a vector of 1s of length n. It’s easy to see immediately that the MLE of α gives
the sample mean
αi =1
n
n∑j=1
yij
Let S = 1n (Y − α1T )(Y − α1T )T be the sample covariance, then (2.1) can be rewritten as
L(Y ;L,ΣF ,Σ) = −Nn log(2π)− n
2log det |LΛFL
T + Σ| − n
2tr[(LΛFL
T + Σ)−1S]
(2.2)
8
CHAPTER 2. BACKGROUND 9
Finding the global optimal solution maximizing (2.2) given r is generally a hard problem. For a
special case Σ = σ2IN , there is an explicit solution for L which are exactly the principal components.
The result is proved in Anderson and Rubin (1956) using the estimation equations derived by Lawley
(1940).
Theorem 2.1.1. Assume that Σ = σ2IN , Assumption 3 and in particular LTL is diagonal. Let the
eigenvalue decomposition of S be S = PΛPT where P ∈ RN×N is orthogonal and Λ = diag(λ, λ2, · · · , λN )
with λ ≥ λ2 ≥ · · · ≥ λN . Let Pr ∈ RN×r be the first r columns of P and Λr = diag(λ, λ2, · · · , λr).Then the solutions maximizing (2.2) are
L = Pr(Λr − σ2Ir)1/2, σ2 =
1
N − r
N∑k=r+1
λk
For a general diagonal Σ, it’s hard to find a global maximum solution. One popular method is
to use the EM algorithm proposed by Rubin and Thayer (1982). Assuming ΛF = Ir, then the joint
Assumption 5 and in particular LTΣ−1L is diagonal. Let L and Σ be the MLE estimates from (2.1).
When n,N →∞, then for each variable i,
Li· − Li· = Op(n−1/2), σ2
i − σ2i = Op(n
−1/2)
where the convergence is in probability and√n(Li· − Li·) and
√n(σ2
i − σ2i ) have a limiting normal
distribution for any given i. For the non-random factor score model,
√n(Li· − Li·)
d→ N(0, σ2i Ir),
√n(σ2
i − σ2i )
d→ N(0, (2 + κi)σ4i )
where κi is the excess kurtosis of eij (for Gaussian noise κi = 0).
For the random factor score model, the limiting covariance of√n(σ2
i − σ2i ) stays the same while
that for√n(Li·−Li·) has a much complicated form which can be found in Section F of the appendix
in Bai and Li (2012a).
CHAPTER 2. BACKGROUND 13
The consistency in Theorem 2.1.3 holds for each i but may not hold uniformly for all i. Uniform
consistency for all i can be derived by assuming that the random variables have exponential tails
and imposing an extra condition on the relationship between n and N approaching the limit.
Theorem 2.1.4. Under the assumptions of Theorem 2.1.3 and assume that eij are sub-Gaussian
random variables, if (logN)2/n→ 0 as n,N →∞, then
maxj≤N‖L2
i· − L2i·‖2 = Op(
√logN/n), max
j≤N|σ2i − σ2
i | = Op(√
logN/n) (2.8)
For the non-random factor model,
maxi=1,2,··· ,N
∥∥∥∥∥∥Li· − Li· − 1
n
n∑j=1
σieijFT·j
∥∥∥∥∥∥2
= op(n− 1
2 ). (2.9)
Proof. The proof is a modification of the proof of Bai and Li (2012). It is very technical, please see
Appendix A.
In Bai and Li (2012a) the authors also have shown the consistency of the estimated factor scores
using either (2.6) or (2.7) which has the same limiting distribution.
Theorem 2.1.5. Under the assumptions of Theorem 2.1.3 and assume that N/n2 → 0 when n,N →∞, then for the non-random factor score model,
√N(F·j − F·j)
d→ N(0, Q−1L )
Here QL = limN→∞ LTΣ−1L is defined in Assumption 5. For the random factor score model,
when N/n→ γ > 0 the limiting distribution of√N(F·j − F·j) also is asymptotically Gaussian with
a complicated limiting covariance matrix, and the result is stated in Section F of Bai and Li (2012)
appendix.
We should mention that the results in Bai and Li (2012a) are not very satisfactory, though very
impressive. All the results are based on Assumption 5(e), which is not guaranteed to always hold
for MLE (or QMLE) estimates. However, this is currently the best results we can find. In Bai and
Li (2016), the same authors generalized the results to approximate factor models, which allow weak
correlations among the noises (and between the noise and factors for random factor score model).
2.2 Principal component analysis (PCA)
PCA is a very common technique for dimension reduction of the data. It is closely related to factor
analysis, and often is used as a solution (or at least an initial solution) for factor analysis in both
classical and high-dimensional data analysis. In this section, we discuss the use of PCA (and its
CHAPTER 2. BACKGROUND 14
equivalent form SVD) in factor analysis and the asymptotic properties of PCA under the factor
analysis assumptions. We consider three different asymptotic regions: the classical fixed N and
n → ∞ assumption; the n,N → ∞ and strong factor (limN→∞ LTL/N is some positive definite
matrix) assumption in econometrics and the n,N → ∞ and weak factor (limN→∞ LTL to some
finite positive definite matrix) assumption in random matrix theory. We still assume that r is given
as a constant in any of the asymptotic regions.
2.2.1 Use of PCA in factor analysis
Basically, PCA tries to find linear combinations of the observed variables to maximize the sample
variance. Let S = (Y − α1T )(Y − α1T )T /n as defined before. Let the eigenvalue decomposition of
S be S = PΛPT where Λ = diag(λ, λ2, · · · , λN ) is a diagonal matrix where λ ≥ λ2 ≥ · · · ≥ λN ≥ 0
and P is an orthogonal N × N matrix. Then the eigenvectors P·1, · · · , P·N are called loadings,
and the rows of PT (Y − α1T ) are called principal components (PCs). For more details of PCA, one
can refer to Jolliffe (1986).
Also, one can derive the PCs and loadings from the singular value decomposition (SVD) of
Y − α1T . Let Y − α1T =√nUDV T where U ∈ RN×min(N,n), V ∈ Rn×min(N,n) and D =
diag(d1, d2, · · · , · · · dmin(N,n)) with UT U = V T V = Ir and d1 ≥ d2 ≥ · · · ≥ dmin(N,n). It’s then
clear that the first m loadings are columns of U and the first m principal components are columns
of√nV D. Notice that one has the identity that d2
k = λk for k = 1, 2, · · · ,min(N,n).
To use PCA in factor analysis, we essentially use the loadings of PCA to estimate the linear
space of factor loadings and the PCs to estimate the linear space of factor scores. Let Pr ∈ RN×r be
the first r columns of P (which is the same as Ur, the first r columns of U) and Λr = diag(λ, · · · , λr)(Dr = diag(d1, · · · , dr)). Under the identification condition that either ΣF = Ir for the random
factor score model or MF = Ir for the non-random factor score model, and LTL is diagonal with
decreasing diagonal entries, we will have
Lpc = PrΛ1/2r = UrDr, F
pc =√nV Tr = LT (Y − α1T ) (2.10)
To estimate the noise variances, it’s common to use
Σpc = diag
(1
n(Y − α1T − LF )(Y − α1T − LF )T
)(2.11)
2.2.2 Asymptotic properties
As promised, we discuss three different asymptotic regimes.
CHAPTER 2. BACKGROUND 15
N fixed and n→∞
First, let’s consider the asymptotic region where N is fixed and n→∞. As we have discussed, the
sample covariance S → LLT + Σ in this asymptotic region. Thus, P is a consistent estimator of the
eigenvectors of LLT + Σ. If the diagonal entries of Σ are distinct, then the linear space spanned by
P will be different from the linear space spanned by L, indicating that Lpc would not be a consistent
estimator of L, thus F pc and Σpc would also be inconsistent.
Notice that Lpc in (2.10) is very similar to the MLE estimator when Σ = σ2IN in Theorem 2.1.1,
but without adjustment on Λr by subtracting σ2Ir. Theorem 2.1.2 shows that under the classical
asymptotic region, MLE estimators are consistent. Thus, even in the simplest case where Σ = σ2Ir,
Lpc is inconsistent for estimating L since Lpc − L = σ2Pr is not approaching 0 and L is consistent.
N,n→∞ with strong factors
It turns out that PCA estimates becomes consistent when N,n→∞ and the factors’ strength keeps
growing with N . Bai (2003) and Bai and Ng (2013) derived consistency and asymptotic normality
results for the PCA estimates under the same assumptions as for the MLE in Theorem 2.1.3.
Theorem 2.2.1. Assume all the assumptions in Theorem 2.1.3 except that now LTL instead of
LTΣ−1L is diagonal with LTL/N → ΣL. Then when n,N → ∞, we have Lpci· → Li· with rate
min(√n,N) for each i = 1, 2, · · · , N . Also, F pc
·j → F·j with rate min(√N,n) for each j = 1, · · · , n.
Besides, under the non-random factor score model, we have the following asymptotic normality:
1. If√n/N → 0 when N,n→∞, then for each i
√n(Lpc
i· − Li·)d→ N
(0, σ2
i Ir)
2. If√N/n→ 0 when N,n→∞ then for each j
√N(F pc
·j − F·j)d→ N
(0, (ΣL)−1QΣL
)where Q = limN→∞ LTΣL/N ∈ Rr×r.
For the limiting covariance under the random factor score model which has much more compli-
cated forms, one can refer to Bai (2003). Bai (2003) and Bai and Ng (2013) actually have the results
for more general assumptions which allow for the noise covariance Σ to change with n. This makes
the result more applicable to some econometrics applications.
Comparing the results of PCA in Theorem 2.1.3 and MLE in Theorem 2.2.1, the asymptotic
efficiency in estimating L is the same, while F pc is less efficient than the MLE estimate F . The
difference is the same as running a GLS instead of OLS for solving F given the true L.
CHAPTER 2. BACKGROUND 16
In Fan et al. (2013), the authors strengthened the consistency result of Theorem 2.2.1 to uniform
consistency.
Assumption 6. There exists mt and bt for t = 1, 2 such that for any s > 0, i ≤ N and j ≤ n,
P [|εij | > s] ≤ exp (1− (s/b1)m1)
Also, under the non-random factor score model, assume that the entries of F are bounded |Fkj | ≤ Cfor some constant C or under the random factor score model assume that
P [|Fkj | > s] ≤ exp (1− (s/b2)m2)
for any k ≤ r.
Theorem 2.2.2. Under the assumptions of Theorem 2.2.1 and Assumption 6 with n = o(N2) and
logN = o(nγ) where (6γ)−1 = 3m−11 + 1.5m−1
2 + 1, we have
maxi≤N‖Lpc
i· − Li·‖ = Op
(√1
N+
√logN
n
), and
maxj≤n‖F pc·j − F·j‖ = Op
(√1
n+
√n1/2
N
).
The original authors have shown that both Theorem 2.2.1 and Theorem 2.2.2 also hold for a
much weaker assumptions than Assumption 1 and Assumption 2. They hold also for approximate
factor models, where the noise can be weakly correlated.
N,n→∞, N/n→ γ > 0 with weak factors
For high-dimensional data, in many cases there are weak factors where it’s inappropriate to assume
that ‖L·k‖22 is O(N). For example, if factor k has sparse loadings, say ‖L·k‖0 = O(1), then ‖L·k‖22 =
O(1) when ‖L·k‖∞ = O(1) (Witten et al., 2009). It’s very common to have sparse factors for high-
dimensional data. For example, the observed variables may be just locally correlated and have a
block structure. The weak factors become really hard to estimate accurately though. One reason
is that though the true factors are sparse, the singular vectors of the signal matrix X may not be
sparse. There have been research on random matrix theory showing that even for the homoscedastic
noise model Σ = σ2IN , PCA is not consistent. There is also a detection threshold that if some
eigenvalue of ΛTΛ is below the threshold, there is no hope to estimate any its information from
spectral analysis.
There has been rich literature in Random Matrix Theory (RMT) in understanding the asymptotic
properties for estimating the weak factors by PCA (or SVD), especially when Σ = σ2IN . Under
CHAPTER 2. BACKGROUND 17
the identification condition that LTL is diagonal, let L = UD where U ∈ RN×r, UTU = Ir and
D = diag(d1, · · · , dr). We impose the following assumptions for weak factors:
Assumption 7. 1. For each k = 1, 2, · · · , r, when n,N → ∞, dka.s.→ ρk for some constant
ρk <∞. For simplicity, also assume that ρ1 > ρ2 > · · · > ρr > 0.
2. The noise entries have finite fourth moment: E[e4ij
]< ∞ for i = 1, 2, · · · , N and j =
1, 2, · · · , n
Then people have shown that the estimates D and U are inconsistent estimates of D and U , thus
the PCA estimates Lpc and F pc are inconsistent for L and F . There are many reference literature.
For instance, the random factor model result can be found in Paul (2007); Yao et al. (2015); Nadler
(2008). The non-random factor model result can be found in Perry (2009); Benaych-Georges and
Raj Rao (2012); Onatski (2012).
Theorem 2.2.3. Assume either Assumption 1 or Assumption 2, Assumption 7 and Σ = σ2IN .
When n,N →∞ with N/n→ γ > 0, we have for k, k1, k2 = 1, 2, · · · , r:
1.
dka.s.→ ρk =
√(
ρk + 1ρk
)(ρk + γ
ρk
)σ if ρk > γ1/4σ(
1 +√γ)σ if ρk ≤ γ1/4σ
2.
UT·k1U·k2
a.s.→
θk =
√ρ4k−γ
ρ4k+βρ2
kif k1 = k2 = k, ρk > γ1/4σ
0 otherwise
3. For the estimated factor scores,
F pck1·F
Tk2·
n
a.s.→
θk =
√ρ4k−γ
ρ4k+βρ2
kif k1 = k2 = k, ρk > γ1/4σ
0 otherwise
If we further assume that N/n = γ + o(n−1/2
)and dk − ρk = o
(n−1/2
)as n,N → ∞, then if
ρk > γ1/4, then√n(dk − ρk
),√n(UT·kU·k − θk
)and
√N(F pck1·F
Tk2·/n− θk
)all have a limiting
Gaussian distribution with mean 0.
The limiting variances for√n(dk − ρk
),√n(UT·kU·k − θk
)and√N(F pck1·F
Tk2·/n− θk
)are dif-
ferent for random and non-random factor score models, and one can find the specific forms in the
above references.
For the heteroscedastic noise factor analysis model where Σ = diag(σ21 , · · · , σ2
N ) has arbitrary
diagonals, the problem is much more complicated, as the distribution noise matrix Σ1/2E is not
CHAPTER 2. BACKGROUND 18
invariant under the orthogonal transformation of the rows. One result we can show is that we can
combine the results in Onatski (2012) and Benaych-Georges and Raj Rao (2012) to get limits of dk,
U·k and F pck· under a random factor score model which also has uniformly distributed factor loading
entries for each factor. Let the cumulative distribution function (CDF) of the empirical distribution
of the noise variances be GN (x) = 1N
∑Ni=1 1σ2
i≤x. Then we need the following assumption:
Assumption 8. When n,N →∞, GN (x)→ G0(x) for all x ∈ R where the limiting cumulative dis-
tribution function (CDF) G0(x) has bounded support [a0, b0] and the corresponding density function
g0(x) satisfies minx∈(a0,b0) g(x) > 0. Also maxi≤N σ2i → b0 and mini≤N σ
2i → a0.
Based on the results of Onatski (2010, 2012), we know that under Assumption 8, the empirical
distribution of the eigenvalues of 1nΣ1/2EETΣ1/2 for N/n → γ ≤ 1 converge to some limiting
distribution with CDF G(x). If γ ≤ 1, the support of G(x) is bounded [a0, b0], and the largest
and smallest eigenvalues of 1nΣ1/2EETΣ1/2 converge to b0 and a0 respectively. If γ > 1, G(x) =
1γ G(x) +
(1− 1
γ
)δ0 where the support of G(x) is bounded [a0, b0], and the largest and smallest
non-zero eigenvalues of 1nΣ1/2EETΣ1/2 converge to b0 and a0 respectively. To use Benaych-Georges
and Raj Rao (2012), we assume that the factor loading matrix is also random and impose an extra
condition on L under the random factor score model:
Assumption 9. For factor score matrix L = UD,√N · U ∈ RN×r has i.i.d. entries with mean 0
and variance 1 and D = diag(d1, · · · , dr) is a deterministic diagonal matrix.
Based on results of Benaych-Georges and Raj Rao (2012), let the D-transform of G(x) be defined
as
DG(z) =
[∫z
z2 − xdG(x)
]×[γ
∫z
z2 − tdG(x) +
1− γz
]for z > a.
For f a function and x ∈ R, denote f (x+) = limz↓x f(z). Also, in the theorem below, D−1G (·)
denotes its function inverse on [a,∞). We then have results analogous to Theorem 2.2.3.
Theorem 2.2.4. Under Assumption 1, Assumption 8 and Assumption 9, when n,N → ∞ and
N/n→ γ, we have
1. For k = 1, 2, · · · , r
dka.s.→ ρk =
D−1G
(1/ρ2
k
)if ρk >
(DG
(a+
0
))−1/2
a0 otherwise
2. For k1 = 1, 2, · · · k0 and k2 = 1, 2, · · · , r where k0 =∑rk=1 1
ρk>(DG(a+0 ))−1/2 ,
UT·k1U·k2
a.s.→
√−2φG(ρk)ρ2kD′G(ρk)
if k1 = k2 = k
0 otherwise
CHAPTER 2. BACKGROUND 19
3. For the estimated factor scores and k1 = 1, 2, · · · k0 and k2 = 1, 2, · · · , r where k0 =∑rk=1 1
ρk>(DG(a+0 ))−1/2 ,
F pck1·F
Tk2·
n
a.s.→
√−2φG(ρk)
ρ2kD′G(ρk)
if k1 = k2 = k
0 otherwise
where G = γG+ (1− γ)δ0 when γ ≤ 1 and ρk is defined in Theorem 2.2.3.
Because of the inconsistency of PCA even assuming Σ = σ2IN , there have been several improve-
ment of the PCA estimates to reduce the estimation error. One direction is to shrink d1, d2, · · · , drtowards 0 while keeping the estimated eigenvectors (singular vectors) unchanged (Shabalin and No-
bel, 2013). Based on the result of Gavish and Donoho (2014), define the PCA optimal shrinkage
estimator as
Lsk = Urη(Dr
), F sk = F pc (2.12)
where η(Dr) = diag(η(d1), · · · , η(dr)
). The shrinkage function η(·) is defined as
η(d) =
σ2
d
√(d2
σ2 − γ − 1)2 − 4γ if d ≥ (1 +
√γ)σ
0 Otherwise
and is the optimal function that minimizes the asymptotic estimation error limn,N→∞,N/n→γ ‖X −LskF sk‖2F . In practice when σ is unknown, Gavish and Donoho (2014) proposed a consistent esti-
mator σ which is based on the median of the singular values of Y . Raj Rao (2014) considered this
optimal shrinkage of sample singular values for a general noise variance Σ, including the heterosced-
asitc noise factor analysis model once it satisfies the assumptions in Theorem 2.2.4.
2.3 Estimating the number of factors
Now we review the methods in estimating r under the three different regimes that we have discussed
in the previous section.
2.3.1 Classical methods
For the classical problem where N is relatively small compared with the sample size n, estimating
the number of factors r is a very hard problem. Many methods have been proposed for estimating
the number of principal components, but very few methods work specifically for the factor analysis
model, which has additive heteroscedastic noise that is not present in PCA. One method is based
on likelihood ratio tests (Lawley, 1956; Bartlett, 1950; Anderson and Rubin, 1956). For a given r,
define the null hypothesis H0r: Φ = LLT + ΣF for some L ∈ RN×r and diagonal matrix ΣF . The
CHAPTER 2. BACKGROUND 20
calculation in Anderson and Rubin (1956) has shown that the likelihood ratio test statistic using
(2.1) is
Ur = n[log det Σ + log det
(Ir + LT Σ−1L
)− log detS
]and under the asymptotic regime that N is fixed and n → ∞, Ur follows a chi-square distribution
with N(N − 1)/2 + r(r − 1)/2 − rN degrees of freedom. To estimate r, one can sequentially test
for H00, H01, · · · and stop at r if H0r is not rejected. However, this sequential testing method does
not have any theoretical guarantees and has been shown to perform poorly in practice (Tucker and
Lewis, 1973; Velicer et al., 2000), for example it is sensitive to the normality assumption and tend
to underestimate r when n is large.
Based on empirical evaluations, some researchers (Velicer et al., 2000; Buja and Eyuboglu, 1992;
Velicer, 1976) suggest that even if one believes the factor analysis model, one can first assume
Σ = σ2IN to determine the number of principal components r and then estimate the factors based
on (1.3) given r. For estimating the number of principal components, popular methods include scree
test (Cattell, 1966; Cattell and Vogelmann, 1977), Kaiser’s rule (Kaiser, 1960), parallel analysis
(PA) (Horn, 1965; Buja and Eyuboglu, 1992), the minimum average partial test of Velicer (1976)
and information criteria based methods such as minimum description length (MDL) (or Bayesian
Information Criteria (BIC)) and Akaike Information Criteria (AIC) (Wax and Kailath, 1985; Fishler
et al., 2002). To use these methods effectively for factor analysis, one essentially applies various rules
on the sample correlation matrix S. For example, the two simplest rules are the Scree test which
plots the eigenvalues of S in an decreasing order and r is determined by identifying an “elbow” of
the eigenvalue curve, and Kaiser’s rule which estimates r as the number of eigenvalues of S that
exceed 1.
Among all these methods, there is a large amount of evidence Zwick and Velicer (1986); Hubbard
and Allen (1987); Velicer et al. (2000); Peres-Neto et al. (2005) showing that PA is one of the most
accurate of the above classical methods for determining the number of factors. Parallel analysis
compares the observed eigenvalues of the correlation matrix to those obtained in a Monte Carlo
simulation. The first factor is retained if and only if its associated eigenvalue is larger than the 95’th
percentile of simulated first eigenvalues. For k ≥ 2, the k’th factor is retained when the first k − 1
factors were retained and the observed k’th eigenvalue is larger than the 95’th percentile of simulated
k’th factors. The permutation version of PA was introduced by Buja and Eyuboglu (1992). There
the eigenvalues are simulated by applying independent uniform random permutations to each of the
variables stored in Y . The earlier method of Horn (1965) resamples from a Gaussian distribution.
Parallel analysis has been used recently in bioinformatics (Leek and Storey, 2008b; Sun et al., 2012).
Though there exist no theoretical results to guarantee the accuracy of PA, it performs very well in
practice.
CHAPTER 2. BACKGROUND 21
2.3.2 Methods for large matrices and strong factors
This collection of methods is designed for an asymptotic regime where both n,N → ∞ while r is
fixed. Again, for strong factors, it is assumed that LTL/N → ΣL. Under such asymptotic regime
and the strong factor assumption, it is theoretically easy to find a consistent estimator of r since the
first r eigenvalues of LLT + Σ will explode as n,N →∞.
Some of the most popular methods to estimate the number of factors under the above scenario
are based on the information criteria developed by Bai and Ng (2002). Define
V (k) =1
Nn
∥∥∥Y − Lpck F
pck
∥∥∥2
F(2.13)
where Lpck and F pc
k are defined in (2.10) with the number of factors given as k. Let K ≥ r be
some fixed known constant, then Bai and Ng have proposed a series of information criteria and have
shown the following results:
Theorem 2.3.1. Under the assumptions of Theorem 2.2.1, if g(N,n)→ 0 and min(√N,√n)g(N,n)→
∞ as N,n→∞, then r defined by
r = argmin0≤k≤K V (k) + kg(N,n)
or
r = argmin0≤k≤K log V (k) + kg(N,n)
where V (k) is defined in (2.13) is a consistent estimator: limN,n→∞ P [r = r] = 1.
They then proposed 6 specific forms for the criteria and one that performs among the best based
on their empirical evaluations is
rIC1 = argmin0≤k≤K log V (k) + k
(N + n
Nn
)log
(Nn
N + n
)(2.14)
The bound K is not specified and depends on the researcher’s prior knowledge on the problem. Bai
and Ng’s criteria are known to be unrobust in practice, thus Alessi et al. (2010) proposed a modified
version:
r?IC1 = argmin0≤k≤K log V (k) + ck
(N + n
Nn
)log
(Nn
N + n
)(2.15)
where c is tuning parameter that is determined adaptively based on the data. For determing c,
the authors used a stability principle which chooses c that yields a stable r?IC1 using randomly sub-
sampled rows and columns. This improvement may be hard to execute in practice as a proper range
of c need to be given since a large enough c can stably estimate r as 0 while a small enough c can
stably estimate a large r.
Onatski (2010) developed an estimator based on the difference of two adjacent eigenvalues (ED)
CHAPTER 2. BACKGROUND 22
of the sample covariance matrix. The estimator he proposed is
rED = maxk ≤ K : d2
k − d2k+1 ≥ δ
(2.16)
where δ > 0 is some fixed number. Denote the ordered noise variances as σ2(1) ≤ σ2
(2) ≤ · · · ≤ σ2(N).
Roughly speaking, the estimator is based on the result that if for any i, σ2(i+1)− σ
2(i) → 0 as N →∞,
then d2k − d2
k+1 → 0 for any k > r. An advantage of his estimator is that the consistency of rED can
allow for a much weaker strength of the factors:
limn,N→∞,N/n→γ>0
P [rED] = 1
for any fixed δ > 0 as long as the smallest eigenvalue of LLT explodes (tends to infinity). To optimize
the performance of rED, he also gave an iterative procedure to adaptively determine δ from the data.
Another simple criterion is proposed in Ahn and Horenstein (2013). They proposed two estima-
tors for determining the number of factors by simply maximizing the ratio of two adjacent eigenvalues
of the sample covariance matrix. The same idea can be also found in Lam and Yao (2012); Lan and
Du (2014). One specific form is:
rED = argmax0≤k≤Kd2k
d2k+1
(2.17)
with d20 =
∑min(n,N)k=1 d2
k/ log min(n,N).
Besides the above criteria, there are more methods to estimate the number of factors (Forni
et al., 2000; Amengual and Watson, 2007; Hallin and Liska, 2007) for dynamic factor models. As
we have mentioned, such dependency models are beyond the scope of this paper.
Remark. To use rIC1, rED and rER, we need to determine the upper bound K for r. There is no
theoretical result to guide choosing K. For practical usage, Onatski (2010) suggested trying several
different K to see how r changes. Ahn and Horenstein (2013) suggested using
K = min
|i ≥ 1 : d2i ≥
min(n,N)∑k=1
d2k/min(n,N)|, 0.1 min(n,N)
.
2.3.3 Methods for large matrices and weak factors
In contrast to strong factors, for weak factors the asymptotic LLT → ΣL instead of LLT /N → ΣL
where ΣL is a positive definite matrix is more appropriate. Based on our discussion in previous
sections, even for the homoscedastic noise Σ = σ2IN , neither the PCA nor MLE estimates of the
factor loadings and scores are consistent. Moreover, Theorem 2.2.3 shows that there is a phase
transition phenomenon in the limit: if ρk < γ1/4σ, then the spectral analysis from the samples
CHAPTER 2. BACKGROUND 23
would not contain any information of the kth principal component of the signal matrix. In other
words, under the identification condition LTL is diagonal, for n,N large enough, if there is some
dk < γ1/4σ, then there would be little chance for detection of the kth factor using PCA or MLE
(Kritchman and Nadler, 2009). For the general factor model where there is heteroscedastic noise,
there would also exist a phase transition phenomenon. Thus, it would be impossible and also very
likely useless to estimate the true number of factors.
For Σ = σ2IN , we define r =∣∣dk : dk > γ1/4σ
∣∣ as the number of detectable factors. One goal
is to estimate this number of detectable factors. Raj Rao and Edelman (2008) used an AIC type
information criterion method based on RMT. The criterion is based on the distribution of
tk = (N − k) ·∑Ni=k+1 d
4k(∑N
i=k+1 d2k
)2
which is asymptotically Gaussian when n,N → ∞ and N/n → γ > 0 for k = 0, 1, · · · ,min(n,N).
The estimator they proposed is
rRE = argmin0≤k≤min(N,n)
[1
4γ2n
(N [tk − (1 + γn)]− γn)2
+ 2(k + 1)
](2.18)
In Raj Rao and Edelman (2008), the authors conjectured that rRE is a consistent estimator for r,
however, Kritchman and Nadler (2009) proved that the conjecture is not true and rRE tends to
underestimate r. Instead, Nadler (2010) proposed another modification of AIC which estimates r
more accurately:
rAIC′ = argmin0≤k≤min(N,n)
[−L(Y ; L, Ir, σ
2IN ) + 2k (2N + 1− k)]
(2.19)
where L(·) is defined in (2.2) with L and σ2 derived in Theorem 2.1.1. For estimating r, Kritchman
and Nadler (2008) also developed a consistent estimator based on a sequence of hypothesis tests
which are connected with Roy’s classical largest root test Roy (1953). It has the form:
rRMT = argmin1≤k≤min(N,n)
d2k < σ2(k) (µn,N−k + s(α)ξn,N−k)
− 1 (2.20)
which is derived based on sequential tests H0k: at most k− 1 signals versus H1k: at least k signals.
α is some significance level and values of µn,N−k, s(α) and ξn,N−k are derived using RMT. σ2(k) is
estimated unbiased via an iterative algorithm.
Instead of estimating r, one would prefer estimating the number of useful factors: r? = argmink E[‖Xk −X‖F
]where X = LF is the signal matrix and Xk is a rank k estimate of X. r? best fits the purpose if one
want an accurate estimation of the signal matrix. For Σ = σ2IN and Xpck = Lpc
k Fpck , Perry (2009)
has shown that under the assumptions of Theorem 2.2.3, when n,N → ∞ and N/n → γ > 0 we
CHAPTER 2. BACKGROUND 24
have
r? →∣∣ρk : ρ2
k ≥ µ?F∣∣
where
µ?F = σ2
1 + γ
2+
√(1 + γ
2
)2
+ 3γ
The reason that r? ≤ r is that for some factors are too weak to be estimated accurately, though they
are detectable. Thus we prefer to ignore them to increase the accuracy of our estimation. Perry
(Perry, 2009) proposed a bi-cross-validation (BCV) method to estimate r? that we will discuss in
more detail in next chapter.
2.4 Comments
Theorems 2.1.3, 2.2.1 and 2.2.3 work for both random and non-random factor models. Thus without
additional assumptions, assuming the random and non-random factor score model are equivalent.
With additional assumptions, then assuming one of the two models can be more convenient. For
instance, if the latent variables have a time series structure then it’s more convenient to assume
random factor scores. On the other hand if the factor scores are assumed to be sparse or non-
negative (especially for specific samples or at specific locations), then assuming non-random factor
scores can be more natural.
For high-dimensional data and large matrices where both n and N are large, we believe that
for many real data there are both weak and strong factors. The strong factors are factors that are
uniformly influential on all variables while weaker factors may only have large effects on a subset of
the variables. However, there is not much theoretical work considering presence of both strong and
weak factors.
Apparently, for the general factor model assumption that the noise is heteroscedastic, there
is barely no estimators for r (either r or r?) and not many methods available in estimating the
factors and signal matrix designed for large matrices with presence of weak factors. In the next two
chapters, we propose two methods in estimating the signal matrix X = LF and the noise variance
Σ without knowing r. One method is based on maximum likelihood and BCV, the other is based
on combining optimizing a convex penalized loss function and the optimal shrinkage proposed by
Gavish and Donoho (2014).
Chapter 3
Bi-cross-validation for factor
analysis
3.1 Problem Formulation
Our data matrix is Y ∈ RN×n with a row for each variable and a column for each sample. In the
bioinformatics problems we have worked on, it is usual to have N > n or even N n, but this is
not assumed. In a factor model, Y can be decomposed into a low rank signal matrix plus noise:
Y = X + Σ12E = LF + Σ
12E, (3.1)
where the low rank signal matrix X ∈ RN×n is a product of factors L ∈ RN×r and F ∈ Rr×n, both of
rank r. The noise matrix E ∈ RN×n has independent and identically distributed (IID) entries with
mean 0 and variance 1. Each variable has its own noise variance given by Σ = diag(σ21 , σ
22 , · · · , σ2
N ).
The signal matrix X is a signal that we wish to recover despite the heteroscedastic noise.
The factor model is usually applied when we anticipate that r min(n,N). Then identifying
those factors suggests possible data interpretations to guide further study. When the factors cor-
respond to real world quantities there is no reason why they must be few in number and then we
should not insist on finding them all in our data as some factors maybe too small to estimate. We
should instead seek the relatively important ones, which are the factors that are strong enough to
contribute most to the signals and be accurately estimated.
We focus on the non-random factor score model as we treat the signal matrix X as the parameter.
As we have discussed previously, a random factor score model can be treated as a non-random factor
25
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 26
score model by conditioning on F . our goal is to recover X, seeking to minimize
ErrX
(X)
= E[∥∥∥X −X∥∥∥2
F
](3.2)
This criterion was used for factor models in Onatski (2015) and for truncated SVDs and nonnegative
matrix factorizations in Owen and Perry (2009). After recovering X, we can estimate the factor
loadings and scores using corresponding identification conditions.
Definition 3.1.1 (Oracle rank and estimate). Let M be a method that for each integer k ≥ 0 gives
a rank k estimate XM (k) of X using Y from model (3.1). The oracle rank for M is
r?M = argmink
∥∥∥XM (k)−X∥∥∥2
F, (3.3)
and the corresponding oracle estimate of X is
XMopt = XM
(r∗M). (3.4)
If all the factors are strong enough, then for a good method M , we anticipate that r?M should
equal the true number of factors r. With weak enough factors we will have r?M < r.
Our algorithm has two steps. First we devise a method M that can effectively estimate X given
the oracle rank r?M . Then with such a method in hand, we need a means to estimate r?M . Section 3.2
describes our early stopping alternation (ESA) algorithm for finding X(k) for each k, which has
the best performance compared with other methods given their own oracle ranks. Then Section 3.3
describes our BCV for estimating k?ESA for the ESA algorithm.
3.2 Estimating X given the rank k
Here we consider how to estimate X using exactly k factors. This will be the inner loop for an
algorithm that tries various k. The goal in this section is to find a method that has good performance
with its oracle rank. We start with the likelihood function
L(Y ;X,Σ) = −Nn2
log(2π)− n
2log det Σ + tr
[−1
2Σ−1(Y −X)(Y −X)T
]. (3.5)
which is similar to (2.4). If Σ were known it would be straightforward to estimate X using an SVD,
but Σ is unknown. Given an estimate of X it is straightforward to optimize the likelihood over
Σ. Thus, if we want to maximize (3.5), it is very natural to design an alternating algorithm that
iteratively estimates X given Σ and then estimates Σ given X. Specifically, define the truncated
SVD of a matrix Y as
Y (k) =√nU(k)D(k)V (k)T (3.6)
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 27
where D(k) is the diagonal matrix of the k largest singular values of Y/√n, and U(k) and V (k) are
the matrices of the corresponding singular vectors. The iterative algorithm starts from an initial
estimate Σ(0) using the sample variance:
Σ(0) = diag((Y − 1
nY 1n×n
)(Y − 1
nY 1n×n
)T ). (3.7)
Given an estimate Σ, the rank k estimate X is the truncated SVD of the reweighted matrix Y =
Σ−12Y :
X = Σ12 Y (k). (3.8)
Given an estimate X, the new variance estimate Σ contains the mean squares of the residuals:
Σ =1
ndiag
[(Y − X
)(Y − X
)T ]. (3.9)
Both of the above two steps can increase logL(X,Σ) but not decrease it. However, as we have
discussed in Chapter 2, the likelihood (3.5) is ill-posed, thus this alternating algorithm can’t work
directly. Here we propose an even simpler early-stopping algorithm.
The main challenge for using (3.5) is to prevent any σi from approaching 0. One solution is to
instead optimize the quasi-likelihood (2.5) by EM algorithm. The other solution is to regularize Σ
to prevent σi → 0. One could model the σi as IID from some prior distribution. However, such a
distribution must also avoid putting too much mass near zero. We believe that this transfers the
singularity avoidance problem to the choice of hyperparameters in the σ distribution and does not
really solve it. We have also found in trying it that even when σi are really drawn from our prior,
the algorithm still converged towards some zero estimates.
A related approach is to employ a penalized likelihood
Lreg
(Y ;λ, X, Σ
)= −n log det Σ + tr
[Σ−1(Y − X)(Y − X)T
]+ λP
(Σ), (3.10)
where P penalizes small components σi. This approach has two challenges. It is hard to select a
penalty P that is strong enough to ensure boundedness of the likelihood, without introducing too
much bias. Additionally, it requires a choice of λ. Tuning λ by cross-validation within our bi-cross-
validation algorithm is unattractive. Also there is a risk that cross-validation might choose λ = 0
allowing one or more σi → 0.
We do not claim that the regularization methods cannot in the future be made to work. However,
we propose a much simpler approach that works surprisingly well. Our approach is to employ early
stopping. We start at (3.7) and iterate the pair (3.8) and (3.9) some number m of times and then
stop.
To choose m, we investigated 180 test cases based on the six factor designs in Table 3.1, three
dispersion levels for the σ2i , five aspect ratios γ and 2 data sizes. The details are in the Appendix.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 28
The finding is that taking m = 3 works almost as well as if we used whichever m gave the smallest
error for each given data set.
More specifically, define the oracle estimating error using early stopping at m steps as
ErrX(m) = mink‖Xm(k)−X‖2F (3.11)
where Xm(k) is the estimate of X using m iterations and rank k. We judge each number m of steps,
by the best k that might be used with it.
For early stopping alternation (ESA), we define the oracle stopping number of steps on a data
Table 3.4: Worst case REE values for each method of choosing k for white noise and two het-eroscedastic noise settings.
compares with the oracle rank of SVD. For Figure 3.1d, the BCV method also uses the SVD instead
of ESA. Though the results in Table 3.2 (Appendix) suggest that SVD is in general not recommended
for heteroscedastic noise data, if one does use SVD, BCV is still the best method for choosing r to
recover X.
The proportion of simulations with REE = 0 (matching the oracle’s rank) for BCV was 51.6%,
75.1%, 28.1% and 47.0% in the four scenarios in Figure 3.1. BCV’s percentage was always highest
among the six methods we used. The fraction of REE = 0 sharply increases with sample size and is
somewhat better for ESA than for SVD.
Table 3.4 briefly summarizes the REE values for all three noise variance cases. It shows the worst
case REE over all the 10 matrix sizes and 6 factor strength scenarios. As the variance of σ2i rises
it becomes more difficult to attain a small REE. BCV has substantially smaller worst case REE for
heterscedastic noise than all other methods, but is slightly worse than RE for the white noise case.
This is not surprising as NE is designed for the white noise model.
To better understand the differences among the methods, we compare them directly in estimating
the number of factors with the oracle. As an example, Figure 3.2 plots the distribution of r for all
methods and all 6 cases, on 5000×100 data matrices with Var[σ2i
]= 1. The results are summarized
in more detail in Tables 3.5 and 3.6. In Figure 3.2, BCV closely tracks the oracle. For other methods,
ED performs the best in estimating the oracle rank, though it is more variable and less accurate
than BCV. ER is the most conservative method, trying to estimate at most the number of strong
factors. IC1 also tries to estimate the number of strong factors, but is less conservative than ER. RE
estimates some number between the number of strong factors and the number of useful (including
strong) factors. PA has trouble identifying the useful weak factors when strong factors are present,
and also has trouble rejecting the detectable but not useful factors in the hard case with no strong
factor. This is due the fact that PA is using the sample correlation matrix which has a fixed sum of
eigenvalues, thus the magnitude of the each eigenvalue is influenced by every other one.
Tables 3.5 and 3.6 provide more details of the simulation results for this mildly heteroscedastic
case Var[σ2i
]= 1. We can see that some methods behave very differently for different sized datasets.
For example, IC1 is very non-robust and sharply over-estimates the number of factors for small
datasets, ED will tend to estimate only the number of strong factors when the aspect ratio γ is
small. Overall, BCV has the most robust and accurate performance in estimating r∗ESA of the
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 40
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(a) All datasets, ESA
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(b) Large datasets only, ESA
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(c) Small datasets only, ESA
0 1 2 3 4 5
0%
20%
40%
60%
80%
100% PAEDERIC1NEBCV
(d) All datasets, SVD
Figure 3.1: REE survival plots: the proportion of samples with REE exceeding the number on thehorizontal axis. Figure 3.1a-3.1c are for REE calculating using the method ESA. Figure 3.1a showsall 6000 samples. Figure 3.1b shows only the 3000 simulations of larger matrices of each aspect ratio.Figure 3.1c shows only the 3000 simulations of smaller matrices. For comparison, Figure 3.1d is theREE plot for all samples calculating REE using the method SVD.
methods we investigated.
3.5 Real data example
We investigate a real data example to show how our method works in practice. The observed matrix
Y is 15 × 8192, where each row is a chemical element and each column represents a position on a
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 41
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−1: 0/6/1/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−2: 2/4/1/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−3: 3/3/1/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−4: 3/1/3/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−5: 1/3/3/1
True PA ED ER IC1 NE BCVOracle
0
2
4
6
8
10
Type−6: 0/1/6/1
Figure 3.2: The distribution of r for each factor strength case when the matrix size is 5000 ×100. The y axis is r. Each image depicts 100 simulations with counts plotted in grey scale(larger equals darker). For different scenarios, the factor strengths are listed as the number of“strong/useful/harmful/undetectable” factors in the title of each subplot. The true k is alwaysr = 8. The “Oracle” method corresponds to r∗ESA.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 42
Table 3.5: Comparison of REE and r for rank selection methods with various (N,n) pairs,and scenarios. For each different scenario, the factors’ strengths are listed as the number of“strong/useful/harmful/undetectable” factors. For each (N,n) pair, the first column is the REE
and the second column is k. Both values are averages over 100 simulations. Var[σ2i
]= 1.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 43
64 × 128 map of a meteorite. We thank Ray Browning for providing this data. Similar data are
discussed in Paque et al. (1990). Each entry in Y is the amount of a chemical element at a grid
point. The task is to analyze the distribution patterns of the chemical elements on that meteorite,
helping us to further understand the composition.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 44
200
250
300
350
400
450
500
rank k
BC
V P
redi
ctio
n E
rror
0 1 2 3 4 5 6 7 8 9
Figure 3.3: BCV prediction error for the meteorite. The BCV partitions have been repeated 200times. The solid red line is the average over all held-out blocks, with the cross marking the minimumBCV error.
A factor structure seems reasonable for the elements as various compounds are distributed over
the map. The amounts of some elements such as Iron and Calcium are on a much larger scale than
some other elements like Sodium and Potassium, and so it is necessary to assume a heteoroscedastic
noise model as (3.1). We center the data for each element before applying our method.
BCV choose r = 4 factors, while PA chooses r = 3. Figure 3.3 plots the BCV error for each rank,
showing that among the selected factors, the first two factors can be considered as strong factors,
which are much more influential than the last two. The first column of Figure 3.4 plots the four
factors ESA has found at their positions. They represents four clearly different patterns.
As a comparison, we also apply a straight SVD on the centered data with and without standard-
ization to analyze the hidden structure. The second and third columns of Figure 3.4 shows the first
five factors of the locations that SVD finds for the original and scaled data respectively. If we do not
scale the data, then the factor (F5) showing the concentration of Sulfur on some specific locations
strangely comes after the factor (F4) which has no apparent pattern; F5 would have been neglected
in a model of three or four factors as BCV or PA suggest. The figure shows that ESA can estimate
the weak factors more accurately compared with SVD.
Paque et al.Paque et al. (1990) investigate this sort of data by clustering the pixels based on the
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 45
1:128
ESA_F1
1:128
ESA_F2
1:128
ESA_F3
1:128
ESA_F4
1:128
1:64
SVD_F1
1:128
1:64
SVD_F2
1:128
1:64
SVD_F3
1:128
1:64
SVD_F4
1:64
SVD_F5
1:128
1:64
scale_F1
1:128
1:64
scale_F2
1:128
1:64
scale_F3
1:128
1:64
scale_F4
1:64
scale_F5
Figure 3.4: Distribution patterns of the estimated factors. The first column has the four factorsfound by ESA. The second column has the top five factors found by applying SVD on the unscaleddata. The third column has the top five factors found by applying SVD on scaled data in which eachelement has been standardized. The values are plotted in grey scale, and a darker color indicates ahigher value.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 46
values of the first two factors of a factor analysis. We apply such a clustering in Figure 3.5. The plot
shows that ESA can estimate the factor scores of the strong factors more accurately. Column (a)
shows the resulting clusters. The factors found by ESA clearly divide the locations into five clusters,
while the factors found by an SVD on the original data blur the boundary between clusters 1 and 5.
An SVD on normalized data (third plot in column (a)) blurs together three of the clusters. Columns
(b) and (c) of Figure 3.5 show the quality of clustering using k-means based on the first two plots
of Column (a). Clusters, especially C1 and C5, have much clearer boundaries and are less noisy if
we are using ESA factors than using SVD factors. A k-means clustering depends on the starting
points. For the ESA data the clustering was stable. For SVD the smallest group C3 was sometimes
merged into one of the other clusters; we chose a clustering for SVD that preserved C3.
In this data the ESA based factor analysis found factors that, visually at least, seem better. They
have better spatial coherence, and they provide better clusters than the SVD approaches do. For
data of this type it would be reasonable to use spatial coherence of the latent variables to improve the
fitted model. Here we have used spatial coherence as an informal confirmation that BCV is making
a reasonable choice, which we could not do if we had exploited spatial coherence in estimating our
factors.
CHAPTER 3. BI-CROSS-VALIDATION FOR FACTOR ANALYSIS 47
−0.01 0.01 0.03
−0.
020.
000.
020.
04
F1
F2
ESA
−0.02 0.00 0.01 0.02 0.03
−0.
020.
000.
020.
04
F1
F2
SVD
−0.02 0.00 0.02 0.04
−0.
15−
0.05
0.00
F1
F2
scale
(a)
1:128
1:64
ESA_C1
1:128
1:64
ESA_C2
1:128
1:64
ESA_C3
1:128
1:64
ESA_C4
1:128
1:64
ESA_C5
(b)
1:128
1:64
SVD_C1
1:128
1:64
SVD_C2
1:128
1:64
SVD_C3
1:128
1:64
SVD_C4
1:128
1:64
SVD_C5
(c)
Figure 3.5: Plots of the first two factors and the location clusters. The three plots of column (a)are the scatter plots of pixels for the first two factors found by the three methods: ESA, SVD onthe original data and SVD on normalized data. The coloring shows a k-means clustering result for5 clusters. Column (b) has the five clustered regions based on the first two factors of ESA. Column(c) has the five clustered regions based on the first two factors of SVD on the original data aftercentering. The same color represents the same cluster.
Chapter 4
An optimization-shrinkage hybrid
method for factor analysis
4.1 A joint convex optimization algorithm POT
4.1.1 the objective function
As we have discussed in Chapter 3, maximizing the log-likelihood function (3.5) directly would not
work as the global optimization solution can arbitrarily have σi → 0. In this chapter, we switch to
considering an alternative objective function which is not ill-posed and allows us to jointly estimate
X and Σ. The
Let Y = (yij)N×n, X = (xij)N×n and still consider the model (3.1). Define the objective function
as
Lλ(X,Σ;Y ) = L0(X,Σ;Y ) + λ ‖X‖? = n
N∑i=1
σi +
N∑i=1
n∑j=1
(yij − xij)2
σi+ 2√nλ‖X‖? (4.1)
We estimate X and Σ by (Xλ, Σλ
)= argminX,Σ Lλ(X,Σ) (4.2)
The loss L0(X,Z;Y ) is based on an idea proposed by Huber (2011) to jointly estimate σ and β
in an regression model Yi = XTi β+σ2Ei. Huber estimated β and σ2 by minimizing nσ+
∑ni=1(Yi−
XTi β)2/σ which is jointly convex in (β, σ) and yields the same estimates of β and σ as MLE. Such
a huber technique is also called perspective transformation in convex optimization (Owen, 2007).
L0(X,Z;Y ) is also jointly convex in (X,σ1, · · · , σN ). More importantly, it is not ill-posed since
L0(X,Z;Y ) is bounded below by 0. To get a low-rank matrix estimation X, we impose a nuclear
norm penalty on X. The nuclear norm penalty has been widely used in low-rank matrix recovery
48
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS49
and completion (Recht et al., 2010) as ‖X‖? is convex in X. The nuclear penalty is a relaxation
of a rank constraint on X. A larger value of the tuning parameter λ results in a lower rank of
Xλ. We name the joint optimization algorithm with objective function (4.1) as POT (Perspective
transformation Optimization with Trace norm penalty).
4.1.2 Connection with singular value soft-thresholding
When Σ = IN , then
Lλ(X, IN ;Y ) = nN + ‖Y −X‖2F + 2√nλ ‖X‖? (4.3)
and Xλ has explicit forms (Parikh and Boyd, 2014, Chap. 6.7.3). As before, denote the SVD of Y as
Y =√nUDV T where D = diag(d1, · · · , dmin(N,n)). Define Dλ = diag
((d1 − λ)+, · · · , (dmin(N,n) − λ)+
).
Then minimizing the objective function (4.3) gives
Xλ =√nUDλV
T (4.4)
Dλ is soft-thresholding of the singular values of Y . In other words, the solution Xλ keeps the sample
singular vectors but applies soft-thresholding to the sample singular values.
4.1.3 Connection with square-root lasso
By taking the derivative of (4.1) with respect to σi, we can plug in
σi =
√√√√ 1
n
n∑j=1
(yij − xij)2 =1√n‖Yi· −Xi·‖2
to (4.1) and and change the objective function to
Lλ(X;Y ) =
N∑i=1
‖Yi· −Xi·‖2 + λ‖X‖? (4.5)
Equation (4.5) is closely related to the square-root lasso method proposed in Belloni et al. (2011)
for linear regression. Consider the multiple regression problem Y = BZ+Σ1/2E where B ∈ RN×p is
the coefficient parameter matrix and Z ∈ Rp×n is the matrix of known covariates. Consider the case
where p is very large and we want a sparse estimate of B. The square-root lasso method estimates
each Bi· separately by minimizing the objective function
Li(Bi·;Yi·) = ‖Yi· −Bi·Z‖2 + λi ‖Bi·‖1
As discussed in Belloni et al. (2011), the main advantage of square-root lasso is that it’s “pivotal”
in that the scale of the tuning parameter λi is irrelevant to the noise variance σi. Thus, we can set
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS50
λi ≡ λ for some λ and rewrite the square-root lasso objective function for the multiple regression as
Lλ(B;Y ) =
N∑i=1
Li(Bi·;Yi·) =
N∑i=1
‖Yi· −Bi·Z‖2 + λ ‖B‖1
which has very similar form as (4.5).
4.2 Some heuristics of the method
In this section, we want to provide some understanding of the solutions Xλ, and the choice of λ.
These results, though lacking rigorous mathematical justifications, can guide for using the method
in practice.
Negahban et al. (2012) has provided a general theory for the error rate of the estimates θλn ∈argminθ L(θ;Zn) + λnR(θ) where θ is the vector of parameters, Zn is the observed data and R(θ)
is some penalty function which is supposed to be a norm. They require L(θ;Zn) to be a convex
and differentiable function in θ. However, in our problem, L0(X;Y ) =∑i
√∑j(xij − yij)2 is not
differentiable in X, thus unfortunately we can not apply their theory directly.
4.2.1 The theoretical scale of λ
We define the optimal λ minimizing the estimation error of X as
λOpt = argminλ
∥∥∥Xλ −X∥∥∥
2(4.6)
How would the scale of λOpt change with the dimension?
To avoid confusion, we denote the true value of X as X? in this sub-section. For the es-
timates θλn ∈ argminθ L(θ;Zn) + λnR(θ) discussed in Negahban et al. (2012), they require
λn ≥ cR? (∇L(θ?;Zn)) with high probability to guarantee an upper bound controlling the er-
ror rate ‖θλn − θ?‖2. Here, θ? is the true value of θ, R?(·) is the dual norm of R(·) defined as
R?(v) = supR(u)≤1 uT v and c > 1 is some constant. Also, once λn ≥ R? (∇L(θ?;Zn)), the upper
bound is monotonically increasing with λn.
If we use their result in our problem, then we would need λ > ‖∇L0(X?;Y )‖op since the dual
norm of the nuclear norm is operator norm ‖ · ‖op (the largest singular value of the matrix) and
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS51
L0(X;Y ) is differentiable at X? with probability 1 if P [eij = 0] = 0. In other words,
λ >
∥∥∥∥∥∥∥ yij − x?ij√∑n
j′=1(x?ij′ − yij′)2
N×n
∥∥∥∥∥∥∥op
=
∥∥∥∥∥∥∥ εij√∑n
j′=1(εij′)2/n
N×n
∥∥∥∥∥∥∥op
/√n
=
(1√n
+ o
(1√n
))|E‖op
Under the asymptotics that n,N → ∞ and N/n → γ > 0, we have ‖E‖op/√n → 1 +
√γ, thus we
can set the theoretical value of λ as λtheo = 1 +√γ, which is the detection threshold in RMT for
σ = 1.
Another heuristic to derive λtheo is that we would always want the true parameter value be our
solution. Especially when X? = 0, we would like also the estimated X = 0, which is equivalent to
0 ∈ ∇L0(0; Σ1/2E) + λ∂‖X‖? |X=0
This results in λ > ‖∇L(0; Σ1/2E)‖∞ which gives the same λtheo
In practice, we find from our simulations that the actual λOpt follows the trend of λtheo well as
the size of the data matrix changes, though it’s likely to be smaller. We develop a cross-validation
technique to find the actual λOpt from a sequence of candidate λ around λtheo. This will be discussed
in detail in Section 4.4.
4.2.2 The bias in using the nuclear penalty
In our simulations, we find that even XλOpthas a rank that can be much higher than the true rank
of X. Also, when there exist strong factors in the data, XλOpt can be a worse estimator compared
even with the PC estimator Xpc = LpcF pc. This phenomenon stays even for the white noise model
when Σ = IN and XλOptis estimated from (4.3). The phenomenon is due to the bias in estimating
a low-rank matrix introduced by the nuclear penalty. The RMT provides tools to understand (4.3)
under Σ = IN .
As discussed in Section 4.1.2, (4.3) has a closed form solution (4.4) which is keeping the singular
vectors but applying soft-thresholding to the singular values of Y . Under the assumptions of Theorem
2.2.3 and the asympotics N,n→∞ and N/n→ γ, then based on the calculations in Shabalin and
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS52
Nobel (2013); Gavish and Donoho (2014), if λ ≥ λtheo = 1 +√γ we have
1
n
∥∥∥Xλ −X∥∥∥2
F
a.s.→r∑
k=1
ρ2k +
r∑k=1
(ρk − λ)
2+ − 2 (ρk − λ)+ ρkθkθk
(4.7)
where ρk, ρk, θk and θk are defined in Theorem 2.2.3. We can show
Proposition 4.2.1. Define
L∞(λ) =
r∑k=1
ρ2k +
r∑k=1
(ρk − λ)
2+ − 2 (ρk − λ)+ ρkθkθk
Then under the assumptions of Theorem 2.2.3, L∞(λ) is a increasing function of λ when λ ≥ 1+
√γ.
Moreover, limλ↓(1+√γ)∇λL∞(λ) > 0.
Proof. Denote ρr+1 = 1 +√γ. As L∞(λ) is continuous, to show that it is increasing in λ we only
need to show that L∞(λ) is increasing in [ρk+1, ρk) for any k = 1, 2, · · · , r. Given a K, the function
L∞(λ) is quadratic in [ρK+1, ρK):
L∞(λ) =
r∑k=1
ρ2k +
K∑k=1
(ρk − λ)
2 − 2 (ρk − λ) ρkθkθk
Then, h
∇λL∞(λ) = −2
K∑k=1
(ρk − λ) +
K∑k=1
ρkθkθk
= 2K
λ− ∑Kk=1
(ρk − ρkθkθk
)K
.
It’s enough if we can show that ρk − ρkθkθk is a strictly decreasing function of ρk when ρk > γ1/4
and ρk − ρkθkθk = 1 +√γ when ρk ≤ γ1/4. The reason is that then we can have ∇λL∞(λ) > 0
inside [ρK+1, ρK) for any K = 1, 2, · · · , r.By plugging in the expression of ρk, θk and θk in Theorem 2.2.3 with σ = 1, when ρk > γ1/4 we
get
ρk − ρkθkθk =
√(ρ2k + 1)(ρ2
k + γ)
ρk− 1
ρk
ρ4k − γ√
(ρ2k + 1)(ρ2
k + γ)
Taking derivative respect to ρk, we get
∇ρkρk − ρkθkθk
=−(1 + γ)ρ6
k − 6γρ4k − 3γ(1 + γ)ρ2
k − 2γ2
ρ2k(ρ2
k + 1)2(ρ2k + γ)2
< 0
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS53
Thus ρk − ρkθkθk is a strictly decreasing function of ρk when ρk > γ1/4. When ρk ≤ γ1/4, we have
Proposition 4.2.1 indicates that asymptotically λOpt/√n < 1 +
√γ, in other words the optimal
soft-thresholding will include even more than the true number of factors to minimize the estimation
error of X. Also from the above proof we can see that the larger the factors strength (ρ1, · · · , ρr)are, the smaller λ is which means that the rank XλOpt increases when there are stronger factors,
making XλOptless accurate. From our simulations, we find that the solution of (4.1) has the same
problem.
To overcome the bias of soft-thresholding, Shabalin and Nobel (2013) proposed the optimal
shrinker (2.12), which is the optimal shrinkage of the eigenvalues that minimizes the asymptotic
estimation error. Assume Σ = σ2IN , the optimal shrinkage estimator has the form
Xsk =√nUη
(D)V T (4.8)
where η(D) = diag(η(d1), · · · , η(dmin(N,n))
). The shrinkage function η(·) is defined as
η(d) =
σ2
d
√(d2
σ2 − γ − 1)2 − 4γ if d ≥ (1 +
√γ)σ
0 Otherwise(4.9)
Comparing (4.9) with soft-thresholding, the optimal shrinkage has the property that it shrinks larger
eigenvalues less but shrinks smaller eigenvalues more. This provides another point of view why soft-
thresholding works badly when there are strong factors: it shrinks the larger sample singular values
too much while they are actually close to the true values. The optimal shrinkage can be more
accurate than soft-thresholding, but it’s hard to generalize it to fit the heteroscedastic noise factor
analysis model. Unfortunately, there is no convex penalty function to replace the nuclear norm ‖X‖?that has the optimal shrinkage as the solution.
4.3 A hybrid method: POT-S
Based on the discussion of Section 4.2.2, we propose a hybrid method (POT-S) that combine the
POT minimizing (4.1) with the optimal shrinkage (4.8).
Once we know Σ, then we can apply to the whitened data matrix Σ−1/2Y with the optimal
shrinkage (4.8) given σ2 = 1. However, we don’t know Σ, thus one hybrid approach is to first
estimate Σ using Σλ which is estimated from minimizing (4.1), and then apply optimal shrinkage to
Σ−1/2λ Y . More specifically, for a given λ, let Yλ = Σ
−1/2λ Y and
X?λ =√n · Σ1/2
λ UYλη(DYλ
)V TYλ, Σ?λ = Σλ. (4.10)
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS54
Here, for a matrix Z ∈ RN×n, we use UZ , VZ and DZ to denote it’s left and right singular vectors
and the singular values with Z =√nUZDZV
TZ .
However, one drawback of X?λ is that it only depends on the estimate Σλ. Another choice is
to have our estimate depend on both Σλ and Xλ. As the main problem of Xλ is that it shrinks
the large singular values too much while inadequately shrinking the small singular values, we can
replace the singular values of Xλ with the singular values of X?λ:
X??λ =
√n · UXλDX?λ
V TXλ, Σ??λ = Σλ. (4.11)
We also define the optimal λ for X?λ and X??
λ respectively as
λ?Opt = argminλ
∥∥∥X?λ −X
∥∥∥2
F,
λ??Opt = argminλ
∥∥∥X??λ −X
∥∥∥2
F.
From our simulations (reference to the empirical results), we find that both X??λ??Opt
and X??λ??Opt
are
more accurate than X?λ?Opt
and X?λ?Opt
. Thus, given λ, we propose the POT-S method using X??λ and
X??λ as the final estimates.
When applying POT-S to a dataset, we need to determine λ. The goal is to find λ that is as
close as possible to the unknown λ??Opt. We will use a Wold-style cross-validation approach which is
discussed in the next section.
4.4 Wold-style cross-validatory choice of λ
We use a cross-validation technique to determine the optimal λ.
There are two types of cross-validation for unsupervised learning. One is bi-cross-validation
(BCV) discussed in Chapter 3 which randomly holds out blocks of the data matrix. The other one
is Wold-style cross-validation which randomly holds out entries of the data matrix. Both techniques
are effective for selecting tuning parameters based on prediction performance if used properly. In
Chapter 3, we used BCV because of the theory proposed by Perry (2009) on choosing the size of
the holdout matrix. For POT and POT-S, we use the Wold-style cross-validation mainly because
of three reasons: 1) for BCV, we find empirically that the estimation of λ is sensitive to the size of
the holdout matrix while finding the theory for choosing the size for POT/POT-S; 2) for Wold-style
cross-validation, the convex optimization step in POT/POT-S can easily handle missing entries; 3)
estimation of λ is not very sensitive to the fraction of hold-out entries once the fraction is small in
Wold-style cross-validation.
The Wold-style cross-validation (Wold, 1978) for a matrix Y start with uniformly and randomly
selecting some entries from Y as held-out data and the rest as held-in data. We apply POT-S to the
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS55
held-in data, and calculate the prediction error of the held-out entries. Then we repeat this random
entry selection step for several times and choose λ that minimizes the average prediction error.
Specifically, for one random selection, define an index matrix M = (mij)N×n ∈ 0, 1N×n.
Iij = 1 if the entry is held-in and 0 if held-out. Then we estimate X and Σ from only the held-in
data, treating the held-out entries as missing. Let ni =∑nj=1mij for i = 1, 2, · · · , N . For the joint
convex optimization step, the objective function can be changed from (4.5) to
Lλ(X;Y,M) =
n∑i=1
√nin
∑j
mij(xij − yij)2 + λ‖X‖? (4.12)
Define the joint convex optimization estimates as
Xλ,M = argminX Lλ(X;Y,M), and σ2i,λ,M =
∑jmij(xij,λ,M − yij)2
ni
For the optimal shrinkage step, we need a full data matrix Y , thus the strategy is that we first
fill in the held-out entries based on the held-in entries. A direct approach is to replace the missing
held-out entries by the entries in Xλ,M in the corresponding positions. However, for an entry
yij = xij + σieij , xij,λ,M would not be a good approximation of yij since it approximates xij but
doesn’t include the noise term σieij . Thus, another approach is to also estimate σieij using bootstrap.
The estimate σieij is estimated by random sampling from yijt − Xijt,λ,M ; t = 1, 2, · · · , ni where
yijt are non-missing entries in the ith row. The held-out entry yij is then fulfilled by xij,λ,M + σieij .
Denote the fulfilled matrix by Yλ,M , then the POT-S estimate X??λ,M is given by applying the
optimal shrinkage step of POT-S to Yλ,M with Σλ,M . Then, the prediction error for the held-out
entries are
PEλ(Y,M) =
∑i,j,mij=0
(yij − x??ij,λ,M
)2
∑i,j 1mij=0
The above random entry selection step is repeated independently for S times, yielding the average
Wold-style cross-validation mean squared prediction error for Y :
PE(λ) =1
S
S∑s=1
PEλ(Y,M (s)),
where M (s) is the index matrix for the sth repeat of random entry selection. Finally, the cross-
validation estimate of λ is
λ?? = argminλ∈[λ,λ] PE(λ) (4.13)
We use a grid search to find λ?? within a range [λ, λ]. We find that setting λ = 0.5λtheo and
λ = 1.3λtheo where λtheo = 1 +√γ gives a wide enough range.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS56
We also find empirically (refer to the simulation table) that the bootstrap-CV approach is better
than simply filling in the held-out entries by entries of Xλ,M . Thus, the POT-S method adopts
the bootstrap-CV approach for cross-validation. For the fraction of entries to be held-out, we find
empirically that holding 10% of entries out makes λ?? a good estimate of λ??Opt.
4.5 Computation: an ADMM algorithm
In this section, we describe the algorithm minimizing the objective function Lλ(X;Y, I) in (4.12).
Optimizing the original objective (4.5) is a special case with ni ≡ n. We denote the Hadamard
product of two matrices A = (aij)m×n and B = (bij)m×n as A B = (aijbij)m×n. M = (mij)N×n
is the indicator matrix of held-in entries.
4.5.1 The ADMM algorithm
For notation convenience, denote
f(X) =
n∑i=1
√nin
∑j
mij(xij − yij)2
g(X) = ‖X‖?
since both f and g are not smooth, we use ADMM to solve the problem (Boyd et al., 2011).
For a given α, the ADMM algorithm for this problem is
Xk+1 = proxαf (Zk − Uk),
Zk+1 = proxαλg(Xk+1 + Uk), and
Uk+1 = Uk +Xk+1 − Zk+1
(4.14)
where proxh(·) is the proximal operator for a function h(·). The proximal operator is defined as
proxh(v) = argminu h(u) +1
2‖u− v‖22.
Remember that for a matrix Z, its SVD is denoted as Z =√nUZDZV
TZ . Also, for a diagonal matrix
D = diag(d1, d2, · · · , dm), we denote Dλ = diag ((d1 − λ)+, · · · , (dm − λ)+) as the soft-thresholding.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS57
Fact 1. Both proxαf (·) and proxαλg(·) have close forms:
proxαf (W ) =
(Y + diag
(1−
α ·√ni/n
‖(Wi· − Yi·) Mi·‖2
)+
(W − Y )
)M (4.15)
+W (1N1Tn −M), and
proxαλg(W ) =√nUWDW,αλV
TW . (4.16)
Proof. For proxαf ,
proxαf (W ) = argminX f(X) +1
2α‖X −W‖2F
= argminX
n∑i=1
√nin
∑j
mij(xij − yij)2 +1
2α‖Xi· −Wi·‖22
proxαf (W )i = argminXi·
√nin
∑j,Iij=1
(xij − yij)2 +1
2α‖Xi· −Wi·‖22
= Yi· + argminXi·
√nin
∑j
mij x2ij +
1
2α‖Xi· + Yi· −Wi·‖22
=
(Yi· +
(1−
α ·√ni/n
‖(Wi· − Yi·) Mi·‖2
)+
(Wi· − Yi·)
)Mi· +Wi· (1Tn −Mi·)
Thus, (4.15) holds.
For proxαλg,
proxαλg(W ) = argminX ‖X‖? +1
2αλ‖X −W‖2F
=√nUWDW,αλV
TW
The above fact shows that at each step of the ADMM iteration, X is first shrinked towards Y
and then is shrinked towards 0. The size of λ decides which direction of the shrinkage dominates.
We can have a low-rank estimated X when λ is large enough. When λ is too small, some of the σ2i
will be estimated as 0.
We adopt the stopping rule used in Boyd et al. (2011). The ADMM algorithm is terminated
when both the primal and dual residuals are small. Here, the primal and dual residuals are
Rk = Xk − Zk, and Sk =1
α(Zk+1 − Zk).
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS58
It is shown that
f(Xk) + λg(Zk)− p? ≤ 1
α‖Uk‖F ‖Rk‖F + ‖Xk −X‖F ‖Sk‖F
Let εabs > 0 be an absolute tolerance and εrel be the relative error per entry. We stop if both
‖Rk‖F ≤√nNεabs + εrel max‖Xk‖F , ‖Zk‖F
‖Sk‖F ≤√nNεabs + ρεrel‖Uk‖F
4.5.2 Techniques to reduce computational cost
The above algorithm actually is very expensive when used for the POT-S method. In this section,
we discuss three techniques we use to reduce the computational cost: varying the penalty parameter,
an approximate SVD and warm starts.
Varying step size α
The above ADMM actually converges very slowly when we want to achieve the desired accuracy.
One modification to reduce the iterations towards convergence is to change the step size α on every
iteration Boyd et al. (2011). We use this simple scheme
αk+1 =
αk/2 if ‖Rk‖F > 10‖Sk‖F
2αk if ‖Sk‖F > 10‖Rk‖F
αk otherwise
The reason behind it is that if α is smaller than it penalize more on the primal residual Rk, when
α is larger it reduces the dual residual.
Acceleration by avoiding full SVD
The bottleneck of the computation in each iteration is the SVD step required for computing proxαλg(W ).
A full SVD to compute all the singular values and vectors of W will be very time-consuming. How-
ever, if W is known to be low-rank, then there is no need for full SVD. Also, computing the singular
value and vector pairs only when di(W ) > αλ is adequate for the soft-thresholding purpose. Both
the above two reasons suggest a partial SVD. Computing partial SVD for these soft-eigenvalue-
thresholding iterations is also widely suggested in the matrix completion literature Cai et al. (2010).
Partial SVD is to only compute the first K singular values and vectors. As our code is written
in R, we use the “svd” package which provides the PROPACK (Larsen, 1998) and nu-TRLAN
(Yamazaki et al., 2010) implementations of partial SVD. From our experience, we find that nu-
TRLAN is slightly faster than PROPACK and less likely to yield an error message. Also, if K is not
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS59
small enough (K > 0.2 min(N,n)), there is no acceleration by computing partial SVD using either
PROPACK or nu-TRLAN, which is suggested in Wen et al. (2012) and also our simulations. In that
situation we will then switch back to the full SVD.
To compute the partial SVD of a matrix W , we need to decide the rank K for W . As suggested
in Cai et al. (2010), we can use information from previous iterations. Here is how this is done. We
initialize with a low rank Z (either using Z = 0 or Z = X computed from a larger λ). Then, as
iterations goes on, the rank of Zk tends to increase slowly. We guess an upper bound of rank(Zk+1)
as rk+1 = rank(Zk) + 5. After computing Zk+1 using rk+1, if rank(Zk+1) < rk+1 then our guess
succeeded. If not, we will recompute Zk+1 using the full SVD.
Warm start
In the cross-validation step, we need to select the best λ from the range [λ, λ] via grid search. Thus,
we indeed need to find a solution path of minimizing the objective function Lλ(X;Y, I) in (4.12) for
λ < λ2 · · · < λM .
To compute the solution path, we start from the largest λ, which gives a very low-rank estimate
Xλ,I . To compute the solution for λm+1, we use the final values of X, Z and U defined in (4.14)
when solving for λm as the starting values of X, Z and U in the optimization for λm+1. This is
called “warm start” in the optimization literature (Boyd et al., 2011). Also, since a small λ results
in a higher rank X which needs a lot time to compute, we want to avoid useless cross-validation for
small λ. In the cross-validation step, we start from the largest λ and stop when min(σi) = 0 for
some i ∈ 1, 2, · · · , N.
4.6 Simulation results
For our simulations, we use the same data generating scheme as described in section 3.4.1. As the
properties of ESA-BCV have been compared thoroughly with other currently existing methods and
ESA-BCV show an advantage, here we mainly compare POT-S with ESA-BCV.
4.6.1 Compare the oracle performances
Here we compare the oracle performances of five methods, the ESA method in Chapter 3 (ESA),
the quasi-maximum likelihood method (QMLE), the method using only joint convex optimization
solution Xλ in (4.2) (POT), the hybrid method using X?λ (POT-S-0) and the hybrid method using
X??λ (POT-S). The oracle estimation error of a method M is denoted as ErrX(M). For ESA and
QMLE, it’s defined as
ErrX(ESA) = Err(XESA(kESA
Opt ))
= mink
Err(XESA(k)
)
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS60
ErrX(QMLE) = Err(XQMLE(kQMLE
Opt ))
= mink
Err(XQMLE(k)
)where XESA(k) and XQMLE(k) are the estimates given the rank k. For the three methods based on
the joint convex optimization, the oracle estimation errors of X are denoted as
ErrX(POT) = Err(XλOpt
), ErrX(POT-S-0) = Err
(X?λ?Opt
), and
ErrX(POT-S) = Err(X??λ??Opt
)Table 4.1 compares the oracle error in estimating X of POT-S with four other approaches. It
is clear that POT-S has the smallest oracle error. The detailed result for each factor strength
scenario and matrix size combination is shown in Table 4.3. Table 4.3 shows that when there are
many strong factors but few weak factors (Scenarios 2, 3 and 4), POT which uses only the joint
convex optimization solution, performs the worst compared with other methods. Because of the
optimal shrinkage step, POT-S and POT-S-0 perform better than ESA which basically apply hard-
thresholding on the singular values and POT which is based on singular value soft-thresholding.
Comparing the performance of POT-S and POT-S-0, we see that X??λ which depends on both Xλ
and Σλ has better oracle error than X?λ which only relies on Σλ. This should convince the readers
that in POT-S we should adopt X??λ as the estimator instead of X?
Table 4.1: Assess the oracle error in estimating X using four measurements. For each of Var[σ2i
]= 0,
1 and 10, the average for every measurement is the average over 10 × 6 × 100 = 6000 simulations,and the standard deviation is the standard deviation of these 6000 simulations.
POT-S also can estimate Σ more accurately. For an estimator Σ, define the estimation error of
Σ as
R(
Σ,Σ)
=
n∑i=1
| log(σ2i )− log(σ2
i )|
Then we compare the error in estimating Σ when the oracle error in estimating X is achieved. In
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS61
other words, define
ErrΣ(ESA) = R(
ΣESA(kESAOpt ),Σ
)ErrΣ(QMLE) = R
(ΣQMLE(kQMLE
Opt ),Σ)
ErrΣ(POT) = R(
ΣλOpt ,Σ), ErrΣ(POT-S-0) = R
(Σλ?Opt
,Σ)
ErrΣ(POT-S) = R(
Σλ??Opt,Σ)
Then the comparison among methods is summarized in Table 4.2. A more detailed result is in
Table 4.3. We do not compare the oracle error in estimating Σ as what we did in estimating X
because of two reasons. One is that the oracle error in estimating X and Σ usually can’t be achieved
at the same tuning parameter, and the other is that achieving the oracle error in estimating Σ
Table 4.4: Four measurements comparing the error in estimating Σ when the oracle error of X isachieved under various (N,n) pairs and factor strength scenario with Var(σ2
i ) = 1. Type-1 to Type-6correspond to the six scenarios in Table 3.1.
larger due to the gap between the oracles of ESA and POT-S.
Figure 4.1 shows the survival curves of REE in estimating X for the three methods we compare.
Basically, all the three methods perform better for larger matrices and CV-Boot then has almost
a perfect recovery of of λ??Opt. For smaller matrices, there is barely any significant improvement of
CV-Boot over Wold-CV on average mainly because that the bootstrap estimate of the noise prefers
a larger n to have better accuracy.
Table 4.5 and Table 4.6 give more details of the simulation results. First, CV-Boot is constantly
more accurate in Wold-CV in most of the cases. Also, the rank of the oracle estimate tracks well
the theoretical threshold that there are 7 detectable factors.
Finally, we compare these methods in estimating Σ with Σλ??Opt. Define
REEΣ(λ) =R(Σλ,Σ)
R(Σλ??Opt,Σ)− 1.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS64
0.0 0.2 0.4 0.6 0.8 1.0
0%
20%
40%
60%
80%
100% ESA−BCVWold−CVCV−Boot
(a) All datasets
0.0 0.2 0.4 0.6 0.8 1.0
0%
20%
40%
60%
80%
100% ESA−BCVWold−CVCV−Boot
(b) Large datasets only
0.0 0.2 0.4 0.6 0.8 1.0
0%
20%
40%
60%
80%
100% ESA−BCVWold−CVCV−Boot
(c) Small datasets only
Figure 4.1: REE survival plots for estimating X: the proportion of samples with REE exceeding thenumber on the horizontal axis. Figure 4.1a shows all 6000 samples. Figure 4.1b shows only the 3000simulations of larger matrices of each aspect ratio. Figure 4.1c shows only the 3000 simulations ofsmaller matrices.
and redefine REEΣ(ESA-BCV) accordingly. In Table 4.7 we see that Boot-CV still performs better
than Wold-CV in most of the cases.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS65
Table 4.5: Comparison of REE and the rank of X with various (N,n) pairs and scenarios. For eachscenario, the factors’ strengths are listed as the number of “strong/useful/harmful/undetectable”factors. For each (N,n) pair, the first column is the REE and the second column the rank theestimated matrix. Both values are averages over 100 simulations. Var
[σ2i
]= 1.
CHAPTER 4. AN OPTIMIZATION-SHRINKAGE HYBRIDMETHOD FOR FACTORANALYSIS66
Table 4.7: Comparison of REEΣ for various (N,n) pairs and scenarios. For each scenario, the factors’strengths are listed as the number of “strong/useful/harmful/undetectable” factors. The values areaverages over 100 simulations. Var
[σ2i
]= 1.
Chapter 5
Confounder adjustment with factor
analysis
In this chapter, we discuss a multiple regression model with bias corrected by factor analysis.
The motivation of the problem is to correct for the biases and correlation of individual test
statistics in multiple hypotheses testing. In many scientific studies, for example microarray analysis,
tens of thousands of tests are typically performed simultaneously. A typical model is that each of
the individual test statistics is via linear regression, regressing the response variable on the variable
of interest, with other known covariates. However, there can be unknown factors that affect the
response variables in many of the individual hypotheses, inducing correlation among the individual
test statistics. Moreover, those latent factors can also be correlated with the variable of interest, then
the test statistics are not only correlated but are also confounded. We use the phrase “confounding”
to emphasize that these latent factors can significantly bias the individual p-values. Simultaneous
inference such as false discovery rate (FDR) control requires independent and correct individual
p-values. Many confounder adjustment methods have already been proposed for multiple testing
over the last decade Gagnon-Bartsch et al. (2013); Leek and Storey (2008b); Price et al. (2006); Sun
et al. (2012). Our goal is to unify these methods in the same framework and study their statistical
properties based on theoretical results in factor analysis.
In microarray data analysis, common sources of confounding factors include unknown technical
bias Gagnon-Bartsch et al. (2013), environmental changes Fare et al. (2003); Gasch et al. (2000)
and surgical manipulation Lin et al. (2006). See Lazar et al. (2013) for a survey. In many studies,
especially for observational clinical research and human expression data, the latent factors, either
genetic or technical, are confounded with primary variables of interest due to the observational
nature of the studies and heterogeneity of samples Ransohoff (2005); Rhodes and Chinnaiyan (2005).
Similar confounding problems also occur in other high-dimensional datasets such as brain imaging
68
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 69
Schwartzman et al. (2008) and metabonomics Craig et al. (2006).
Notation. Subscripts of matrices are used to indicate row(s) whenever possible. For example, if
C is a set of indices, then XC is the corresponding rows of a matrix X. A random matrix E ∈ Rn×p
is said to follow a matrix normal distribution with mean M ∈ Rn×p, row covariance U ∈ Rn×n
and column covariance V ∈ Rp×p, abbreviated as E ∼ MN (M,U, V ), if the vectorization of E by
column follows the multivariate normal distribution vec(E) ∼ N(vec(M), V ⊗ U). When U = In,
this means the rows of E are i.i.d. N(0, V ). We use the usual notation in asymptotic statistics that a
random variable is Op(1) if it is bounded in probability, and op(1) if it converges to 0 in probability.
Bold symbols Op(1) or op(1) mean each entry of the vector is Op(1) or op(1).
5.1 The model and the algorithm
5.1.1 A statistical model for confounding factors
We consider a single primary variable of interest and no other known control variables in this section.
It is common to add intercepts and known effects (such as lab and batch effects) in the regression
model. This extension to multiple linear regression does not change the main theoretical results in
this paper and is discussed later in Section 5.3.
For simplicity, all the variables in this section are assumed to have mean 0 marginally. Our model
is built on the already widely used linear model in the existing literature and we rewrite it here:
YN×n = βN×1 Z1×n + LN×r Fr×n + Σ1/2EN×n (5.1a)
where Y is observed data matrix of response, Z is the variable of interest and F are the latent factor
variables (or confounders). Each row represents a variable and each column represents a sample.
We assume the random factor score model for the dependence of F and the primary variable Z. We
assume a linear relationship as in
F = αZ +W, (5.1b)
and in addition some distributional assumptions on Z, W and the noise matrix E
Zji.i.d.∼ mean 0, variance 1, j = 1, . . . , n, (5.1c)
W ∼MN (0, Ir, In), W |= Z, (5.1d)
E ∼MN (0, IN , In), E |= (Z,F ). (5.1e)
The parameters in the model eq. (5.1) are β ∈ RN×1 the primary effects we are most interested in,
L ∈ Rp×r the factor loadings, α ∈ RN×1 the association of the primary variable with the confounding
factors, and Σ ∈ RN×N the noise covariance matrix. We assume Σ is diagonal Σ = diag(σ21 , . . . , σ
2N ),
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 70
so the noise for different outcome variables is independent.
In (5.1c), Zi is not required to be Gaussian or even continuous. For example, a binary or categor-
ical variable after normalization also meets this assumption. The parameter vector α measures how
severely the data are confounded. For a more intuitive interpretation, consider an oracle procedure of
estimating β when the confounders F in eq. (5.1a) are observed. The best linear unbiased estimator
in this case is the ordinary least squares (βOLSi , LOLS
i ), whose variance is σ2i Var [Zj , Fj ]
−1/n. Using
(5.1b) and (5.1d), it is easy to show that Var(βOLSi ) = (1 + ‖α‖22)σ2
i /n and Cov(βOLSi1
, βOLSi2
) = 0 for
i1 6= i2. In summary,
Var(βOLS) =1
n(1 + ‖α‖22)Σ. (5.2)
Notice that in the unconfounded linear model in which F = 0, the variance of the OLS estimator
of β is Σ/n. Therefore, 1 + ‖α‖22 represents the relative loss of efficiency when we add observed
variables F to the regression which are correlated with Z. In Section 5.2, we show that the oracle
efficiency (5.2) can be asymptotically achieved even when F is unobserved.
5.1.2 Model identification
Following Sun et al. (2012), we introduce a transformation of the data to make both the identification
issues clearer. Consider the Householder rotation matrix Q ∈ Rn×n such that ZQ = ‖Z‖2eT1 =
(‖Z‖2, 0, 0, . . . , 0). Right-multiplying Y by Q, we get Y = Y Q = β‖Z‖2eT1 + LF + Σ1/2E, where
F = FQ = (αZ +W )Q = ‖X‖2e1αT + W , (5.3)
and W = WQd= W , E = EQ
d= E. As a consequence, the first and the rest of the columns of Y are
where F = FQ and E = EQd= E. Equation (5.29) corresponds to the nuisance parameters B0 and
is discarded according to the ancillary principle. Equation (5.30) is the multivariate extension to
(5.4) that is used to estimate B1 and (5.31) plays the same role as (5.5) to estimate L and Σ.
We consider the asymptotics when n,N → ∞ and d, r are fixed and known. Since d is fixed,
the estimation of L is not different from the simple regression case and the results of QMLE in
Corollary 5.2.1 still holds under the same assumptions.
Let
Σ−1Z = Ω =
(Ω00 Ω01
Ω10 Ω11
).
In the proof of Theorems 5.2.1 and 5.2.3, we consider a fixed sequence of Z such that ‖Z‖2/√n→ 1.
Similarly, we have the following lemma in the multiple regression scenario:
Lemma 5.3.1. As n→∞, U11UT11/n
a.s.→ Ω−111 .
Proof. First, notice that by the strong law of large numbers 1n
(Z0
Z1
)(ZT0 ZT1
)a.s.→ ΣZ . Using the
QR decomposition(ZT0 ZT1
)= QU and writing UT =
(V 0
)and V =
(U00 0
U10 U11
), it’s clear
that V V T /na.s.→ ΣZ . Since ΣZ is nonsingular, then V , U00 and U11 are full rank square matrices
with probability 1. Also using the block matrix inversion formula, we have
V −1 =
(U−1
00 0
−U−111 U10U
−100 U−1
11
).
Therefore the right bottom block of nV −TV −1 is nU−T11 U−111 and converges to Ω11 almost surely.
Thus the statement in the lemma holds.
Similar to (5.4), we can rewrite (5.30) as
Y1U−111 = B1 + L(A1 + W1U
−111 ) + Σ1/2E1U
−111
where W1 ∼ MN (0, IN , Id1) is independent from E1. As in Section 5.2, we derive statistical prop-
erties of the estimate of B1 for a fixed sequence of Z, W1 and F , which also hold unconditionally.
For simplicity, we assume that the negative controls are a known set of variables C with B1,C = 0.
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 83
We can then estimate each column of A1 by applying the negative control (NC) or robust regression
(RR) we discussed in Section 5.1.3 to the corresponding row of Y1U−111 , and then estimate B1 by
B1 = Y1U−111 − LA1.
Notice that E1U−111 ∼MN
(0, U−T11 U−1
11 ,Σ). Thus the “samples” in the robust regression, which are
actually the N variables in the original problem are still independent within each column. Though
the estimates of each column of A1 may be correlated, we will show that the correlation won’t affect
inference on B1. As a result, we still get asymptotic results similar to Theorem 5.2.3 for the multiple
regression model (5.27):
Theorem 5.3.1. Under the assumptions of Corollary 5.2.1 and Assumptions 10 to 12, if n,N →∞,
then for any fixed index set S with finite cardinality |S|,
√n(BNC
1,S − B1,S)d→MN (0|S|×d1
,ΣS + ∆S ,Ω11 + AT1 A1), and (5.32)
√n(BRR
1,S − B1,S)d→ MN(0|S|×d1
,ΣS ,Ω11 + AT1 A1) (5.33)
where ∆S is defined in Theorem 5.2.1.
Proof. First, for the known zero indices scenario, ANC1 has the following formula, which is similar to
(5.7):
ANC1 = (LTC Σ−1
C LC)−1LTC Σ−1
C Y1,CU−111 (5.34)
which implies a similar formula as (5.20):
√n(BNC
1,S − B1,S) =√nE1,SU
−111 −
√n · LS(LTCΣ−1
C LC)−1LTCΣ−1
C E1,CU−111
+√n · (L(0)
S − LS)A(0)1
+√n · LS(LTCΣ−1
C LC)−1LTCΣ−1
C (LC − L(0)C )A
(0)1 + op(1),
(5.35)
where A(0)1 = R−1(A1 + W1U
−111 ). Following the proof of Theorem 5.2.1 by using Lemma 5.3.1, we
get (5.32).
For the sparsity scenario, Lemma 5.3.1 guarantees the consistency of each column of ARR1 by
using Theorem 5.2.2. Then the Taylor expansion used in the proof of Theorem 5.2.3 still works at
each column of A(0)1 . Similar to (5.25), we get
√n(BRR
1 − B1) =√nE1U
−111 +
√n(L(0) − L)ARR
1
+ L(g1 g2 · · · gd1
) (5.36)
where gi =[∇Ψp(A
(0)1,i )]−1
(√nΨp
(A
(0)1,i ) + op(1)
). Following the proof of Theorem 5.2.3, we get
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 84
each gi = op(1). Thus
√n(BRR
1 − B1) =√nE1U
−111 +
√n · (L(0) − L)ARR
1 + op(1)
and (5.33) holds.
As for the asymptotic efficiency of this estimator, we again compare it to the oracle OLS estimator
of B1 which observes confounding variables Z in (5.27). In the multiple regression model, we claim
that BRR1 still reaches the oracle asymptotic efficiency. In fact, let B =
(B0 B1 L
). The oracle
OLS estimator of B, BOLS, is unbiased and its vectorization has variance Σ⊗ V −1/n where
V =
(ΣZ ΣZAT
AΣZ Ir + AΣZAT
), for A =
(A0 A1
).
By the block-wise matrix inversion formula, the top left d×d block of V −1 is Σ−1Z +ATA. The variance
of BOLS1 only depends on the bottom right d1 × d1 sub-block of this d × d block, which is simply
Ω11 + AT1 A1. Therefore BOLS
1 is unbiased and its vectorization has variance Σ ⊗ (Ω11 + AT1 A1)/n,
matching the asymptotic variance of BRR1 in Theorem 5.3.1.
5.4 Numerical experiments
5.4.1 Simulation results
In this section we use numerical simulations to verify the theoretical asymptotic results and further
study the finite sample properties of our estimators and tests statistics.
The simulation data are generated from the single primary variable model (5.1). More specifically,
Zi is a centered binary variable (Zi + 1)/2i.i.d.∼ Bernoulli(0.5), and Y·j , F·j are generated according
to (5.1).
For the parameters in the model, the noise variances are generated by σ2i
i.i.d.∼ InvGamma(3, 2), i =
1, . . . , N , and so E(σ2i ) = Var(σ2
i ) = 1. We set each αk = ‖α‖2/√r equally for k = 1, 2, · · · , r where
‖α‖22 is set to 1, so the variance of Xi explained by the confounding factors is R2 = 50%. The pri-
mary effect β has independent components βi taking the values 3√
1 + ‖α‖22 and 0 with probability
π = 0.05 and 1 − π = 0.95, respectively, so the nonzero effects are sparse and have effect size 3.
This implies that the oracle estimator has power approximately P(N(3, 1) > z0.025) = 0.85 to detect
the signals at a significance level of 0.05. We set the number of latent factors r to be either 2 or
10. For the latent factor loading matrix L, we take L = UD where U is a N × r orthogonal ma-
trices sampled uniformly from the Stiefel manifold Vr(RN ), the set of all N × r orthogonal matrix.
As we assume strong factors, we set the latent factor strength D =√N · diag(d1, · · · , dr) where
dk = 3−2(k−1)/(r−1) thus d1 to dr are distributed evenly inside the interval [3, 1]. As the number
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 85
of factors r can be consistently easily estimated for this strong factor setting, we assume that the
number r of factors is known to all of the algorithms in this simulation.
We set N = 5000, n = 100 or 500 to mimic the data size of many genetic studies. For the
negative control scenario, we choose |C| = 30 negative controls at random from the zero positions
of β. We expect that negative control methods would perform better with a larger value of |C| and
worse with a smaller value. The choice |C| = 30 is around the size of the spike-in controls in many
microarray experiments (Gagnon-Bartsch and Speed, 2012). For the loss function in our sparsity
scenario, we use Tukey’s bisquare which is optimized via IRLS with an ordinary least-square fit as
the starting values of the coefficients. Finally, each of the four combinations of n and r is randomly
repeated 100 times.
We compare the performance of nine different approaches. There are two baseline methods:
the “naive” method estimates β by a linear regression of Y on just the observed primary variable
Z and calculates p-values using the classical t-tests, while the “oracle” method regresses Y on
both Z and the confounding variables F as described in ??. There are three methods in the
RUV-4/negative controls family: the RUV-4 method (Gagnon-Bartsch et al., 2013), our “NC”
method which computes test statistics using βNC and its variance estimate (1 + ‖α‖22)(Σ + ∆),
and our “NC-ASY” method which uses the same βNC but estimates its variance by (1 + ‖α‖22)Σ.
We compare four methods in the SVA/LEAPP/sparsity family: these are “IRW-SVA” (Leek and
Storey, 2008b), “LEAPP” (Sun et al., 2012), the “LEAPP(RR)” method which is our RR estimator
using M-estimation at the robustness stage and computes the test-statistics using (5.26), and the
“LEAPP(RR-MAD)” method which uses the median absolute deviation (MAD) of the test statistics
in (5.26) to calibrate them. (see Section 5.2)
To measure the performance of these methods, we report the type I error, power, false discovery
proportion (FDP) and precision of hypotheses with the smallest 100 p-values in the 100 simulations.
For both the type I error and power, we set the significance level to be 0.05. For FDP, we use
Benjamini-Hochberg procedure with FDR controlled at 0.2. These metrics are plotted in Figure 5.1
under different settings of n and r.
First, from Figure 5.1, we see that the oracle method has exactly the same type I error and FDP
as specified, while the naive method and SVA fail drastically. SVA performs performs better than
the naive method in terms of the precision of the smallest 100 p-values, but is still much worse than
other methods. Next, for the negative control scenario, as we only have |C| = 30 negative controls,
ignoring the inflated variance term ∆S in Theorem 5.2.1 will lead to overdispersed test statistics,
and that’s why the type I error and FDP of both NC-ASY and RUV-4 are much larger than the
nominal level. By contrast, the NC method correctly controls type I error and FDP by considering
the variance inflation, though as expected it loses some power compared with the oracle. For the
sparsity scenario, the “LEAPP(RR)” method performs as the asymptotic theory predicted when
n = 500, while when n = 100 the p-values seem a bit too small. This is not surprising because
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 86
r = 2 r = 10
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
n = 100
n = 500
NaiveIRW−SVA
NCNC−ASY
RUV−4
LEAPP(RR)
LEAPP(RR−MAD)
LEAPPOracle
NaiveIRW−SVA
NCNC−ASY
RUV−4
LEAPP(RR)
LEAPP(RR−MAD)
LEAPPOracle
Type I error Power FDP Top 100
Figure 5.1: Compare the performance of nine different approaches (from left to right): naive regres-sion ignoring the confounders (Naive), IRW-SVA, negative control with finite sample correction (NC) ineq. (5.17), negative control with asymptotic oracle variance (NC-ASY) in eq. (5.18), RUV-4, robust re-gression (LEAPP(RR)), robust regression with calibration (LEAPP(RR-MAD)), LEAPP, oracle regressionwhich observes the confounders (Oracle). The error bars are one standard deviation over 100 repeated sim-ulations. The three dashed horizontal lines from bottom to top are the nominal significance level, FDR leveland oracle power, respectively.
CHAPTER 5. CONFOUNDER ADJUSTMENT WITH FACTOR ANALYSIS 87
the asymptotic oracle variance in Theorem 5.2.3 can be optimistic when the sample size is not
sufficiently large. On the other hand, the methods which use empirical calibration for the variance
of test statistics, namely the original LEAPP and “LEAPP(RR-MAD)”, control both FDP and type
I error for data of small sample size in our simulations. The price for the finite sample calibration
is that it tends to be slightly conservative, resulting in a loss of power to some extent.
In conclusion, the simulation results are consistent with our theoretical guarantees when N is as
large as 5000 and n is as large as 500. When n is small, the variance of the test statistics will be
larger than the asymptotic variance for the sparsity scenario and we can use empirical calibrations
(such as MAD) to adjust for the difference.
Chapter 6
Conclusions
Factor analysis is a powerful dimension reduction tool with explicit statistical model and assump-
tions. The main difference between factor analysis and the more popular PCA technique is the
consideration of heteroscedastic noise in factor analysis. There is no reason to believe that all the
collected variables have the same noise level. More over, even the raw data may have homoscedastic
noise, heteroscedasticity can arise from data transformation in the preprocessing steps (Woodward
et al., 1998). In other words, the factor analysis model has more flexible and reasonable assump-
tions compared with the white noise model in many data problems. However, as the diagonal noise
variances matrix is also unknown, a factor analysis model is harder to solve since there are more
parameters to estimate.
For high-dimensional data, especially when both the variable and sample dimensions are large,
noise heteroscedasticity is not a serious problem when there are only strong factors (defined in
Chapter 2). Strong factors are easy to estimate for high-dimensional data as more information are
collected when an increasing number of variables are observed. PCA still give consistent estimates
of the factor loadings, scores and the noise variance. However, there is no theoretical results of PCA
for weak factors and heteroscedastic noise.
The presence of weak factors complicates solving the factor model of a high-dimensional data
matrix. As discussed in Chapter 2, researchers in the econometrics field consider approximate factor
models where the data can be decomposed as linear combinations of a few strong factors plus weakly
correlated noise. In other words, in the econometrics literature, weak factors are treated as noise and
are responsible for the weak correlations in the noise. On the other hand in Random Matrix Theory,
weak factors are treated as signals and the goal is to estimate them. Throughout this article, we are
treating weak factors as signals.
In Chapter 3 and Chapter 4 we developed two approaches for estimating both the signal matrix
(factor loadings × factor scores) and noise variance. Chapter 3 proposes an iterating algorithm
ESA and a bi-cross-validation technique to estimate the number of factors. ESA can be considered
88
CHAPTER 6. CONCLUSIONS 89
as a heteroscedastic noise version of PCA/SVD and bi-cross-validation randomly select a block of
the matrix as held-out data. In Chapter 4, we proposed an alternative approach called POT-S
starting with a joint convex optimization using perspective transformation and nuclear penalty. At
the final stage, an optimal shrinkage on the singular values is applied to correct for the bias of the
solutions of the optimization. The tuning parameter is selected using Wold-style cross-validation
which randomly selects entries of the matrix as held-out data.
Empirically, the factors retained using ESA-BCV are always fewer than the factors retained using
POT-S. One explanation is that the shrinkage of the estimated barely detectable factors in POT-S
due to the nuclear penalty makes them more “useful” than in ESA-BCV in terms of estimating the
signal matrix. In practice, ESA-BCV is a better tool in giving a more interpretative estimation of
the number of factors while POT-S is superior in reducing the error of the estimated signal matrix
and noise variance. Besides, the cross-validation error plot in ESA-BCV helps analyzing the strength
of each factor.
ESA-BCV and POT-S are two algorithms to estimate a factor analysis model for high-dimensional
data empirically, while there are still many questions need to be answered for theoretical analysis
of the model. One future work is to develop random matrix theory for the factor model with both
strong and weak factors along with the heteroscedastic noise. Another future work is to give upper
bounds for the estimation errors of the signal matrix and noise in the convex optimization algorithm
POT.
Neither ESA-BCV nor POT-S have made use of sparsity of the factors. It has been shown that
the sparsity assumption can greatly improve estimating the weak factors. However, the scenario
becomes complicated when there are both strong and weak factors. A reasonable assumption is that
sparsity of the loadings of a factor is correlated with the factor’s strength. A strong factor usually
has dense loadings while the loadings of a weak factor are likely to be sparse. Adding a penalty which
encourages sparsity will decrease the accuracy in estimating strong factors but improves estimating
the weak factors. It’s an interesting topic to design an adaptive penalization term based on the
strength and initial estimates of factors.
Confounding factor adjustment in multiple regression is an important application of high-dimensional
factor analysis. In Chapter 5 we analyzed a two-step algorithm for the linear regression model with
Gaussian noise. If there are only strong latent factors, it is shown that we can get asymptotically
valid p-values for the individual primary effects with good power even when the factors are con-
founding with the primary variables. The conditions for the result are that the primary effects are
either sparse enough or contain negative controls. When there are also weak factors, empirically we
find that the ranking of the p-values are still meaningful while the p-values themselves can be biased
if the weak factors are confounding.
To broaden the use of the confounding factor adjustment model, a future research direction is
to extend the model to confounder adjustment of multiple generalized linear regression. The model
CHAPTER 6. CONCLUSIONS 90
then can be applied to nonGaussian response matrix such as binary data or counts, which appear
often in applications such as analyzing SNPs and DNA/RNA sequencing data.
Appendix A
Proof
This appendix shows the proof of of Theorem 2.1.4.
We need the following two lemmas before we prove the results. The first lemma shows that the
product of two independent sub-Gaussian random variables is sub-exponential.
Lemma A.0.1. If the random variables Z1 and Z2 both are sub-Gaussian random variables, and
Z1 and Z2 are independent, then their product Z1Z2 is a sub-exponential random variable.
Proof. W.L.O.G. assume that E [Z1] = E [Z2] = 0, thus E [Z1Z2] = 0. As Z1 and Z2 are sub-
Gaussian random variables, using the results in Rivasplata (2012), there exists a1 > 0 and b2 > 0
that
E[ea1X
21
]≤ 2, E
[etX2
]≤ eb
22t
2/2
Then for ∀ |λ| ≤√
2a1/b22, we have
E[eλX1X2
]≤ E
[eλ
2b22X21/2]≤ E
[ea1X
21
]≤ 2
Thus, there exists some b > 0 that
E[eλX1X2
]≤ ev
2λ2/2
for all |λ| ≤ 1/b and v2 ≥ E[X1X
22
]. Thus X1X2 is a sub-exponential variable.
Here is a restatement of Theorem 2.1.4:
Theorem. 2.1.4. Under the assumptions of Theorem 2.1.3 and assume that eij are sub-Gaussian
random variables, if (logN)2/n→ 0 as n,N →∞, then
maxj≤N‖L2
i· − L2i·‖2 = Op(
√logN/n), max
j≤N|σ2i − σ2
i | = Op(√
logN/n) (2.8)
91
APPENDIX A. PROOF 92
For the non-random factor model,
maxi=1,2,··· ,N
∥∥∥∥∥∥Li· − Li· − 1
n
n∑j=1
σieijFT·j
∥∥∥∥∥∥2
= op(n− 1
2 ). (2.9)
We prove uniform convergence of the estimated factors and noise variances by intensively using
some of the technical results in Bai and Li (2012a) and also modify internal parts of their proof.
Before reading the following proof, we recommend that the reader first read the original proof in
Bai and Li (2012a,b). To help the readers to follow, the variables N , T , Λ (or Λ?) and f (or f?) in
Bai and Li (2012a) correspond to N , n, L and F in our notation. The identification condition in
Theorem 2.1.4 for the non-random factor score model corresponds to the IC3 identification condition
in Bai and Li (2012a). Define
H = (LT Σ−1L)−1, and HN = NH.
The lemma below integrates Equation (A.14) of Bai and Li (2012a), Equation (B.9) and the state-
ments Lemma C.1 in Bai and Li (2012b).
Lemma A.0.2. Under the assumptions of Theorem 2.1.4, we have for any i = 1, 2, · · · , N :
LTi· − LTi· =(L− L
)TΣ−1LHLTi· − HLT Σ−1
(L− L
)(L− L
)TΣ−1LHLTi·
− HLT Σ−1L
(1
nFET
)Σ1/2Σ−1LHLTi· − HLT Σ−1Σ1/2
(1
nEFT
)LT Σ−1LHLTi·
− H
(N∑i1=1
N∑i2=1
σi1σi2σ2i1σ2i2
(1
n
(Ei1·E
Ti2· − E
[Ei1·E
Ti2·]))
LTi1·Li2·
)HLTi·
+ H
N∑i1=1
σ2i1− σ2
i1
σ4i1
LTi1·Li1·HLTi·
+ HLT Σ−1Σ1/2
(1
nEFT
)LTi· + HLT Σ−1L
(1
nFETi·
)σi
+ H
(N∑i1=1
σi1σiσ2i1
(1
n
(Ei1·E
Ti· − E
[Ei1·E
Ti·]))
LTi1·
)− H σ2
i − σ2i
σ2i
LTi·
(A.1)
APPENDIX A. PROOF 93
σ2i − σ2
i =1
n
n∑j=1
(e2ij − σ2
i )− (Li· − Li·)(Li· − Li·)T
+ Li·HLT Σ−1(L− L)(L− L)TΣ−1LHLTi· + 2Li·HL
T Σ−1L
(1
nFET
)Σ1/2LHLTi·
+ Li·H
(N∑i1=1
N∑i2=1
σi1σi2σ2i1σ2i2
(1
n
(Ei1·E
Ti2· − E
[Ei1·E
Ti2·]))
LTi1·Li2·
)HLTi·
− Li·HN∑i1=1
σ2i1− σ2
i1
σ4i1
LTi1·Li1·HLTi·
− 2Li·HLT Σ−1Σ1/2
(1
nEFT
)LTi· + 2Li·H
σ2i − σ2
i
σ2i
LTi·
− 2Li·H
(N∑i1=1
σi1σiσ2i1
(1
n
(Ei1·E
Ti· − E
[Ei1·E
Ti·]))
LTi1·
)
+ 2Li·HLT Σ−1(L− L)
(1
nFETi·
)σi
(A.2)
Also, we have the following approximations:
‖HLT Σ−1(L− L)‖F = Op(n−1) +Op(n
−1/2N−1/2) (A.3)
∥∥∥∥∥H(
N∑i1=1
σi1σiσ2i1
(1
n
(Ei1·E
Ti· − E
[Ei1·E
Ti·]))
LTi1·
)∥∥∥∥∥F
= Op(N−1/2n−1/2) +Op(n
−1) (A.4)
∥∥∥∥∥H(
N∑i1=1
N∑i2=1
σi1σi2σ2i1σ2i2
(1
n
(Ei1·E
Ti2· − E
[Ei1·E
Ti2·]))
LTi1·Li2·
)H
∥∥∥∥∥F
= Op(N−1n−1/2) +Op(n
−1)
(A.5)
1
n
∥∥HLT Σ−1Σ1/2EFT∥∥F
= Op(n−1/2N−1/2) +Op(n
−1) (A.6)
∥∥∥∥∥HN∑i1=1
σ2i1− σ2
i1
σ4i1
LTi1·Li1·H
∥∥∥∥∥F
= Op(N−1n−1/2) (A.7)
Proof. Comparing the results with Equation (A.14) of Bai and Li (2012a), Equation (B.9) and the
statements Lemma C.1 in Bai and Li (2012b), we only need to prove (A.3) and (A.6).
APPENDIX A. PROOF 94
To show (A.6), one just needs to apply HN = Op(1) (Bai and Li, 2012a, Corollary A.1), and
the identification condition MF = Ir to simplify Lemma C.1(e) of Bai and Li (2012b) using Central
Limit Theorem.
To prove (A.3), notice that under our conditions (or the IC3 condition of Bai and Li (2012a)),
the left hand side of Equation (A.13) in Bai and Li (2012a) is actually 0 as both the terms Mff
and M?ff in their notation are exactly Ir. Also, HLT Σ−1L = Ir + op(1) from Bai and Li (2012a,
Corollary A.1). Thus, (A.3) holds by applying (A.5), (A.6) and (A.7) to Bai and Li (2012b) Equation
(A.13).
We are now ready to prove Theorem 2.1.4.
First, notice that we only need to prove for the non-random factor score model. For the random
factor score model, we can condition on the factor scores and make a rotation of the factor loadings
and scores to satisfy the identifiability condition that LTΣ−1L is diagonal. The rotation matrix has
size r × r thus would not affect the uniform convergence rate.
Based on equation (F.1) in Bai and Li (2012b), we have
√n(Li· − Li·) =
1√n
n∑j=1
σieijFT·j + op(1). (A.8)
Now we prove (2.8). Let Li· − Li· = b1i + b2i + · · · + b10,i where bkj represents the kth term in
the right hand side of (A.1). Also, let σ2i − σ2
i = a1i + a2i + · · ·+ a10,i where aki represents the kth
term in the right hand side of (A.2).
By applying (A.3), (A.4),(A.5),(A.6), and (A.7) and boundedness of L, we can immediately
get maxj |bkj | = op(n−1/2) for k 6= 8, 10 and maxj |akj | = op(n
−1/2) for k 6= 1, 2, 8, 9, 10. Using
independence of the noise, boundedness of σi and the exponential-decay tail assumption, we can
find that maxj |b8j | = Op(√
logN/n) and maxj |akj | = Op(√
logN/n) for k = 1, 10 by simply using
central limit theorem.
Next, we show the following facts under the assumption that logN/√n → 0: for each s =
1, 2, · · · , r,
maxi=1,2,··· ,N
1
nN
∣∣∣ N∑i1=1
Lis
n∑j=1
[ei1jeij − E [ei1jeij ]]∣∣∣ = op(n
−1/2), and (A.9)
maxi=1,2,··· ,N
1
n2N
N∑i1=1
( n∑j=1
[ei1jeij − E [ei1jeij ]])2
= op(n−1/2). (A.10)
To prove (A.9), we only need to show maxi1nN
∣∣∑i1 6=i
∑nj=1 Lisei1jeij
∣∣ = op(n−1/2) as the remaining
term is op(n−1/2) because of the independence. This approximation is proven by the union bound
APPENDIX A. PROOF 95
and boundedness of L: for ∀ε > 0
limn,N→∞
P
√n maxi=1,2,··· ,N
1
nN
∣∣∑i1 6=i
n∑j=1
Lisei1jeij∣∣ > ε
≤ limn,N→∞
2N · P
C√nN
∑i1 6=1
n∑j=1
ei1je1j > ε
= limn,N→∞
2N · P
1√n
n∑j=1
e1j
( 1√N − 1
∑i1 6=1
ei1j)>
ε
C
N√N − 1
≤ limn,N→∞
2N · E
( 1√n
n∑j=1
e1j
( 1√N − 1
∑i1 6=1
ei1j))4
/( εC
N√N − 1
)4
= 0
To see why the last inequality holds, (N −1)−1/2∑i1 6=1 ei1j ∼ N (0, 1) is independent from e1j , thus
the fourth moment of n−1/2∑nj=1 e1j
((N − 1)−1/2
∑i1 6=1 ei1j
)is bounded which enables us to use
the Markov inequality. To prove (A.10), we start with the same union bound as for (A.9),
limn,N→∞
P
√n maxi=1,2,··· ,N
1
n2N
∑i1 6=i
( n∑j=1
ei1jeij)2> ε
≤ limn,N→∞
N · P
√nn2N
N∑i1=2
( n∑j=1
ei1je1j
)2> ε
≤ limn,N→∞
2N2 · P
1
n
n∑j=1
e2je1j >√εn−1/4
≤ limn,N→∞
2N2 exp(−√nε2) = 0
where ε2 is some positive constant. The last inequality holds as by using Lemma A.0.1. We can
use the Bernstein inequality for sub-Gaussian variables to bound the tail probability. The last limit
holds as we assume logN/√n→ 0.
Equation (A.9) directly implies that
maxi=1,··· ,N
∣∣∣H( N∑i1=1
LTi·1
n
n∑j=1
[ei1jeij − E(ei1jeij)])∣∣∣ = op(n
−1/2)
as H = Op(N−1). Using (A.10) and N−1
∑i ‖Li·−Li·‖22 = Op(n
−1) from Theorem 2.1.3, we get by
APPENDIX A. PROOF 96
using the Cauchy-Schwartz inequality:
maxi=1,··· ,N
∣∣∣H( N∑i1=1
(Li· − Li·)T1
n
n∑j=1
[ei1jeij − E(ei1jeij)])∣∣∣ = op(n
−1).
By writing Li· = Li· − Li· + Li· and using boundedness of both σi and σi,
maxi=1,··· ,N
∣∣∣H( N∑i1=1
σi1σiσ2i1
LTi·1
n
n∑j=1
[ei1jeij − E(ei1jeij)])∣∣∣ = op(n
−1/2) (A.11)
which indicates that maxj |a9j | = op(n−1/2).
To bound the remaining terms, we use the fact that
maxi=1,··· ,N
‖Li·‖2 = Op(1). (A.12)
To see this, first notice that because of boundedness of σi and σi and the fact that H = Op(N−1),
we have maxj |b10,j | = Op(N−1 maxi |Li·|). Combining the previous results on (A.1), we have
maxi |Li· − Li·| = Op(√
logN/n) + op(maxi |Li·|) which indicates that maxi |Li·| = Op(1). Thus,
maxi |a8i| = op(maxi |σ2i −σ2
i |) is negligible and maxi |Li·−Li·| = Op(√
logN/n)+op(maxi |σ2i −σ2
i |).The latter conclusion also indicates that maxi |a2i| = Op(
√logN/n) + op(maxi |σ2
i − σ2i |). As a
consequence, the second claim in (2.8) holds.
Finally, to prove (2.9), we actually have already shown that maxi |Li· − Li· − b8i| = op(n−1/2).
Then,
maxi=1,2,··· ,N
∣∣∣Li· − Li· − 1
n
n∑j=1
σieijFT·j
∣∣∣≤ maxi=1,2,··· ,N
∣∣∣Li· − Li· − b8i∣∣∣+ maxi=1,2,··· ,N
∣∣∣b8i − 1
n
n∑j=1
σieijFT·j
∣∣∣≤op(n−1/2) + ‖HLT Σ−1(L− L)‖F max
i=1,2,··· ,N
∣∣∣ 1n
n∑j=1
σieijFT·j
∣∣∣ = op(n−1/2).
Thus, (2.9) holds.
Bibliography
S. C. Ahn and A. R. Horenstein. Eigenvalue ratio test for the number of factors. Econometrica, 81
(3):1203–1227, 2013.
L. Alessi, M. Barigozzi, and M. Capasso. Improved penalization for determining the number of
factors in approximate factor models. Statistics & Probability Letters, 80(23):1806–1813, 2010.
Y. Amemiya, W. A. Fuller, and S. G. Pantula. The asymptotic distributions of some estimators for
a factor analysis model. Journal of Multivariate Analysis, 22(1):51–64, 1987.
D. Amengual and M. W. Watson. Consistent estimation of the number of dynamic factors in a large
n and t panel. Journal of Business & Economic Statistics, 25(1):91–96, 2007.
J. C. Anderson and D. W. Gerbing. Structural equation modeling in practice: A review and recom-