Robust Likelihood Cross Validation for Kernel Density Estimation Ximing Wu * Abstract Likelihood cross validation for kernel density estimation is known to be sensitive to extreme observations and heavy-tailed distributions. We propose a robust likelihood- based cross validation method to select bandwidths in multivariate density estimations. We derive this bandwidth selector within the framework of robust maximum likelihood estimation and establish its connection to the minimum density power divergence es- timation. This method effects a smooth transition from likelihood cross validation for non-extreme observations to least squares cross validation for extreme observations, thereby combining the efficiency of likelihood cross validation and the robustness of least squares cross validation. An automatic transition threshold is suggested. We demonstrate the finite sample performance and practical usefulness of the proposed method via Monte Carlo simulations and empirical applications to a British income data and a Chinese air pollution data. Key Words: Multivariate Density Estimation; Bandwidth Selection; Likelihood Cross Val- idation; Robust Maximum Likelihood * Department of Agricultural Economics, Texas A&M University, College Station, 77843 TX, USA; [email protected]
28
Embed
Robust Likelihood Cross Validation for Kernel Density Estimation … · 2017. 6. 15. · Furthermore, it works for univariate and multivariate densities alike and is suggested to
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Robust Likelihood Cross Validation for Kernel Density Estimation
Ximing Wu∗
Abstract
Likelihood cross validation for kernel density estimation is known to be sensitive to
extreme observations and heavy-tailed distributions. We propose a robust likelihood-
based cross validation method to select bandwidths in multivariate density estimations.
We derive this bandwidth selector within the framework of robust maximum likelihood
estimation and establish its connection to the minimum density power divergence es-
timation. This method effects a smooth transition from likelihood cross validation for
non-extreme observations to least squares cross validation for extreme observations,
thereby combining the efficiency of likelihood cross validation and the robustness of
least squares cross validation. An automatic transition threshold is suggested. We
demonstrate the finite sample performance and practical usefulness of the proposed
method via Monte Carlo simulations and empirical applications to a British income
data and a Chinese air pollution data.
Key Words: Multivariate Density Estimation; Bandwidth Selection; Likelihood Cross Val-
idation; Robust Maximum Likelihood
∗Department of Agricultural Economics, Texas A&M University, College Station, 77843 TX, USA;[email protected]
1 Introduction
Kernel density estimator (KDE) has been the working horse of nonparametric density es-
timation for decades. Consider an I.I.D. sample of d-dimensional random vectors {Xi}ni=1
from an absolutely continuous distribution F defined on X with density f . In this study, we
are concerned with the following product KDE of multivariate densities:
f(x;h) =1
n
n∑i=1
Kh(x−Xi) ≡1
n
n∑i=1
{d∏s=1
Khs (xs −Xi,s)
}, (1)
where x = (x1, . . . , xd)′, K : R → R+ is taken to be a univariate density function, Kh(·) =
K(·/h)/h, and h = (h1, . . . , hd)′ is a positive vector of bandwidths. Kernel estimation de-
pends crucially on the bandwidth. There exist two major approaches of bandwidth selection:
the plug-in approach and the classical approach. Readers are referred to Scott (1992) and
Wand and Jones (1995) for general overviews of KDE and Park and Marron (1990), Sain
et al. (1994), Jones et al. (1996), and Loader (1999) for in-depth examinations of bandwidth
selection.
This study focuses on the method of Cross Validation (CV), which is a member of the
classical approach and one of the most commonly used methods of bandwidth selection. Some
plug-in methods, cf. Sheather and Jones (1991), are known to provide excellent performance.
However, these plug-in methods often require some complicated derivations and preliminary
estimates, and general purpose plug-in methods for multivariate densities are not available in
the literature. In contrast CV entails no complicated derivations nor preliminary estimates.
Furthermore, it works for univariate and multivariate densities alike and is suggested to be
advantageous for multivariate densities (Sain et al. (1994)). See also Loader (1999) on the
advantages of CV methods over plug-in approaches.
Habbema et al. (1974) introduced the Likelihood Cross Validation (LCV), which is de-
1
fined by
maxh
1
n
n∑i=1
lnfi(h), (2)
where fi(h) = 1/(n− 1)∑
j 6=iKh(Xi −Xj) is the leave-one-out density estimate. Another
popular method, the Least Squares Cross Validation (LSCV), proposed by Rudemo (1982)
and Bowman (1984), is given by
minh
∫Xf 2(x;h)dx− 2
n
n∑i=1
fi(h). (3)
Either method has its limitations. LSCV is known to have high variability and tend to
undersmooth data. It is also computationally more expensive than LCV — especially for
multivariate densities — due to the calculation of integrated squared density in (3). On the
other hand, LCV suffers one critical drawback: sensitivity to extreme observations and tail
heaviness of the underlying distribution; cf. Schuster and Gregory (1981) and Hall (1987)
on this detrimental tail effect. Some possible remedies have been proposed to alleviate
the tail problem of LCV. Marron (1985, 1987) explored trimming of extreme observations.
Hall (1987) suggested using a heavy-tailed kernel function for heavy-tailed densities. This
method, however, performs poorly for thin- or moderate-tailed densities. All these studies
focus on univariate densities.
This study proposes a robust alternative to LCV for multivariate kernel density estima-
tion. The key innovation of our method is to replace the logarithm function (in the LCV
objective) with a function that is robust against extreme observations. In particular, we
consider the following piecewise function: for x > 0,
ln?(x; a) =
lnx, if x ≥ a
lna− 1 + x/a, if x < a
(4)
where a ≥ 0. For x < a, we replace lnx with its linear approximation at a, which is
2
larger than lnx by the concavity of the logarithm function; see in Figure 1 an illustration
of ln?(x; a = 0.1) versus lnx. These two curves coincide for x ≥ 0.1; while ln(x) goes to
minus infinity rapidly as x→ 0, ln?(x; a) declines linearly for x < 0.1, effectively mitigating
the detrimental tail effect associated with LCV. We therefore name our method Robust
Likelihood Cross Validation (RLCV).
0.00 0.05 0.10 0.15 0.20
−7
−6
−5
−4
−3
−2
x
ln(x
) vs
ln∗ (x
;a)
Figure 1: ln?(x; a = 0.1) depicted by solid line and lnx depicted by dash line
The proposed RLCV is defined by
maxh
1
n
n∑i=1
ln?fi(h)− b?(h),
where b?(h) is a bias correction term to be given below. We show that LSCV, LCV and
RLCV can all be obtained within a unifying framework of robust maximum likelihood es-
timation. We also establish a connection between CV-based bandwidth selection and the
minimum density power divergence estimator of Basu et al. (1998). RLCV is in spirit close
to Huber’s (1964) robustification of location estimation, which replaces the least squares
objective function ρ(t) = 12t2 with its linear approximation ρ(t) = k|t|− 1
2k2 when |t| > k for
3
some k ≥ 0. Huber’s estimator nests sample mean (when k →∞) and sample median (when
k = 0) as limiting cases. Similarly, RLCV can be viewed as a hybrid bandwidth selector
that nests LCV (when a = 0) and LSCV (when a→∞). Loosely speaking, RLCV conducts
LSCV on extreme observations, avoiding the tail sensitivity of LCV; at the same time with
a small a, LCV is undertaken on the majority of observations, entailing little efficiency loss.
Therefore it essentially combines the efficiency of LCV and the robustness of LSCV while
eschewing their respective drawbacks.
To make the proposed bandwidth selector fully automatic, we further propose a simple
rule to select the threshold a in ln?(·; a):
an = |Σ|−1/2(2π)−d/2Γ(d/2)(lnn)1−d/2n−1,
where Σ is the variance of X and Γ is the Gamma function. No preliminary estimates nor
additional tuning parameters are required. We conduct a series of Monte Carlo simulations
on densities with varying degrees of tail-heaviness and dimensions. Our results demonstrate
good finite sample performance of RLCV relative to that of LCV and LSCV. RLCV per-
forms similarly to LCV for thin- and moderate-tailed densities and clearly outperforms LCV
for heavy-tailed densities. It also generally performs better than LSCV. We illustrate the
usefulness of RLCV via applications to a British income data and air pollution PM2.5 data
of Beijing.
2 Preliminaries
In this section we present a brief introduction to the approach of robust maximum likelihood
estimation. We shall show in the next section that it provides a unifying framework to
explore bandwidth selection via cross validation, including LSCV, LCV and the proposed
RLCV.
4
Given I.I.D. observations {Xi}ni=1 from an unknown density f defined on X , let us con-
sider a statistical model {g(x;θ) : θ ∈ Θ} of f , where Θ is a parameter space of finite
dimensional θ. Eguchi and Kano (2001) presented a family of robust MLE associated with
an increasing, convex and differentiable function Ψ : R→ R. Let l(x;θ) = ln[g(x;θ)]. They
defined the Ψ-likelihood function as
LΨ(θ) =1
n
n∑i=1
Ψ(l(Xi;θ))− bΨ(θ),
where
bΨ(θ) =
∫X
Ξ(l(x;θ))dx, Ξ(z) =
∫ z
−∞exp(s)
∂Ψ(s)
∂sds.
Since Ψ generally transforms the log likelihood nonlinearly, a bias correction term bΨ is
introduced to ensure the Fisher consistency of the estimator. The maximum Ψ(-likelihood)
estimator is then given by
maxθ∈Θ
LΨ(θ).
Let ψ(z) = ∂Ψ(z)/∂z and the score function S(x;θ) = ∂l(x;θ)/∂θ. The estimating
equation associated with Ψ-estimator is given by
1
n
n∑i=1
ψ(l(Xi;θ))S(Xi;θ) =∂
∂θbΨ(θ), (5)
with
∂
∂θbΨ(θ) =
∫Xψ(l(x;θ)S(x;θ)g(x;θ)dx.
Equation (5) can be rewritten as
∫Xψ(l(x;θ))S(x;θ)d (Fn(x)−G(x;θ)) = 0,
where Fn and G(·;θ) are the empirical CDF and CDF associated with f and g(·;θ) respec-
5
tively. Apparently if f is a member of g(·;θ), this estimating equation is unbiased.
By the monotonicity and concavity of Ψ, ψ(·) ≥ 0 and ψ′(·) ≤ 0. Thus ψ(l(Xi;θ)) can
be interpreted as the implicit weight assigned to Xi, which decreases with its log-density.
Extreme observations are weighted down in the estimating equation, providing desirable
robustness. Note that the classical MLE is a special case of Ψ-estimator with Ψ(z) = z and
ψ(z) = 1. It is efficient in the sense that all observations from an I.I.D. sample are assigned
equal weights in the estimating equation. On the other hand MLE is not robust: since the
log-density of an observation tends to minus infinity as its density approaches zero, extreme
observations exert unduly large influence on the estimation.
A family of Ψ function, which is termed Ψβ function by Eguchi and Kano (2001), turns
out to be particularly useful for the present study. This function is defined as
Ψβ(z) =exp(βz)− 1
β, β > 0.
The corresponding Ψβ estimator is given by
maxθ∈Θ
1
n
n∑i=1
gβ(Xi;θ)
β−∫X
gβ+1(x;θ)
β + 1dx. (6)
The Ψβ estimator is closely related to the minimum density power divergence estimator
by Basu et al. (1998). Let g be a generic statistical model for f , the density power divergence
is defined as
∆β(g, f) =
∫X
{g1+β(x)− β + 1
βgβ(x)f(x) +
1
βf 1+β(x)
}dx, β > 0, (7)
where the third term is constant. It nests the Kullback-Leibler divergence as a limiting case:
∆β→0(g, f) =
∫Xf(x)ln {f(x)/g(x)} dx. (8)
6
It is seen that minimizing the density power divergence (7) with respect to g is equivalent
to maximizing the Ψβ likelihood (6).
The density power divergence is appealing as it is linear in the unknown density f and
does not require a separate nonparametric estimate of f . To see this, note that the second
term in (7) is linear in f and can be readily estimated by its sample analog
1
n
n∑i=1
(β + 1)gβ(Xi)
β.
Most density divergence indices do not afford this advantage. For instance Csiszar’s diver-
gence, with the exception of the Kullback-Leibler divergence, is nonlinear in f ; cf. Beran’s
(1977) minimum Hellinger distance estimator of a parametric model, which requires an ad-
ditional nonparametric estimate of f . Interested readers are referred to Basu et al. (2011)
for a general treatment of minimum density power divergence estimation.
3 Robust Likelihood Cross Validation
3.1 Formulation
RLCV is motivated by replacing the logarithm function in the objective function of LCV with
its robust alternative ln? to alleviate sensitivity to extreme observations. A naive estimator
that maximizes∑n
i=1 ln?(fi(h)), however, does not render consistency. In this section we
show that bandwidths selected using LCV or LSCV can be interpreted as robust maximum
likelihood estimates and also derive RLCV within the robust MLE framework.
We recognize that Ψβ estimator provides a unifying framework to explore cross validation
methods of KDE. The basic idea is to select the bandwidth that maximizes the Ψβ likelihood
of a KDE. In particular, we replace in (6) the parametric model g(x;θ) with the kernel
estimate f(x;h) and the summand g(Xi;θ) with its leave-one-out counterpart fi(h) (to
7
avoid overfitting), yielding:
maxh
1
n
n∑i=1
fβi (h)
β−∫X
fβ+1(x;h)
β + 1dx.
It follows that LSCV/LCV can be viewed as special/ limiting case of this family. Setting
β = 1, we obtain
h1 = arg maxh
1
n
n∑i=1
fi(h)−∫X
f 2(x;h)
2dx,
which coincides with LSCV given in (3). Alternatively letting β → 0 yields
h0 = arg maxh
1
n
n∑i=1
lnfi(h)−∫Xf(x;h)dx,
where the second term is constant at unity. This is equivalent to LCV given in (2).
Furthermore, the Ψβ estimator provides a natural framework to derive the proposed
RLCV. Define
Ψ?(z) =
z, if z ≥ lna
lna− 1 + exp(z)/a, if z < lna
and the corresponding
Ξ?(z) =
exp(z), if z ≥ lna
exp(2z)/2a, if z < lna
Since Ψ? is increasing, convex and piecewise differentiable, a robust MLE based on Ψ? inherits
the properties of Ψ estimator. In particular, we define the robustified likelihood associated
8
with ln? as follows:
L?(h) =1
n
n∑i=1
ln?(fi(h))− b?(h)
where
b?(h) =
∫XI(f(x;h) ≥ a)f(x;h)dx+
1
2a
∫XI(f(x;h) < a)f 2(x;h)dx.
The resulting RLCV bandwidth selector is then given by
h? = arg maxh
L?(h). (9)
−6 −4 −2 0
−6
−4
−2
02
4
z
Ψ∗ (z
) vs
Ψβ(
z)
Figure 2: Ψ?(z) with a = 0.1, solid; Ψβ→0(z), dash; Ψβ=1, dotted; z = lnf, f ∈ [0.001, 5].
To help readers appreciate the role of Ψ in robust MLE, in Figure 2 we plot Ψβ→0(z),
Ψβ=1(z) and Ψ?(z) with a = 0.1, where z = lnf signifies log-density with f ∈ [0.001, 5].
Note that Ψβ→0(z) = lnf , corresponding to LCV, tends rapidly towards minus infinity as f
declines. In contrast, Ψβ=1(z) = f − 1, corresponding to LSCV, is linear in f and therefore
9
robust against small densities. Lastly Ψ?(z) coincides with Ψβ→0(z) when f ≥ a but switches
to f/a + lna − 1 if f < a. Thus like Ψβ=1, Ψ? is linear in f for f < a and robust against
small densities.
3.2 Discussions
Define
ψ?(z; a) =
1, if z ≥ lna
exp(z)/a, if z < lna
It follows readily that the estimating equation associated with RLCV is given by
1
n
n∑i=1
ψ?(lnfi(h))S(Xi;h) =
∫Xψ?(lnf(x;h))S(x;h)f(x;h)dx. (10)
This estimating equation is asymptotically unbiased provided that f(·;h) is a consistent
estimator of f .
Equation (10) can be rewritten as follows:
1
n
n∑i=1
{I(fi(h) ≥ a) + I(fi(h) < a)fi(h)/a
}S(Xi;h)
=
∫X
{I(f(x;h) ≥ a) + I(f(x;h) < a)f(x;h)/a
}S(x;h)f(x;h)dx,
which aptly captures the main thrust of RLCV. When fi ≥ a, RLCV executes LCV and
ψ?(fi) = 1; when fi < a, RLCV executes LSCV and ψ?(fi) = fi/a < 1, which tends to zero
with fi. Given a small positive value for the threshold a, the majority of observations are
assigned a unitary weight while extreme observations, if any, are assigned smaller weights
proportional to their densities. Thus RLCV effectively combines the efficiency of LCV and
the robustness of LSCV.
10
As is discussed in the previous section, Ψβ estimator can be interpreted as a minimum
density power divergence estimation. It follows that CV-based bandwidth selectors can
also be obtained as minimum density power divergence estimators. In particular, LSCV
is obtained by minimizing (7) under β = 1 while LCV is obtained by minimizing (8), the
Kullback-Leibler divergence. Define
∆?(g, f) =
∫g(x)≥a
f(x)ln {f(x)/g(x)} dx+1
2a
∫g(x)<a
{g2(x)− 2g(x)f(x) + f 2(x)
}dx.
(11)
It is seen that RLCV can also be obtained as a minimum density power divergence estimator
by minimizing the hybrid density power divergence (11).
There exists a large body of work on the theoretical properties of cross validation band-
width selection; cf. Hall (1983), Stone (1984), Burman (1985) on LSCV and Hall (1982,
1987) and Chow et al. (1983) on LCV. Marron (1985) considered, for univariate densities, a
modified LCV of the form
maxh
1
n
n∑i=1
fi(h)I(s1 ≤ Xi ≤ s2)−∫ s2
s1
f(x;h)dx, (12)
where f is assumed to be bounded above zero on the interval [s1, s2]. Marron (1987) pro-
posed a unifying framework to explore the asymptotic optimality of LSCV and modified
LCV (12). Let h be the optimizer of LSCV or modified LCV and define
Average Square Error: DA(f , f) = 1/n∑n
i=1
{f(Xi;h)− f(Xi)
}2
w2(Xi)
Integrated Square Error: DI(f , f) =∫ {
f(x;h)− f(x)}2
w(x)dx
Mean Integrated Square Error: DM(f , f) = E
[∫ {f(x;h)− f(x)
}2
w(x)dx
]
11
where w(x) is a nonnegative weight function. Marron (1987) established, under some mild
regularity conditions, the asymptotic optimality of LSCV and modified LCV as follows:
limn→∞
D(f(h), f)
infh∈Hn D(f(h), f)= 1 a.s. (13)
where D is any of DA, DI or DM , and Hn is a finite set whose cardinality grows algebraically
fast. The asymptotic optimality of LSCV is obtained under w(x) = 1 while setting w(x) =
f−1(x)I(s1 ≤ x ≤ s2) gives the desired result for modified LCV.
Below we show that the asymptotic optimality of RLCV can be established in a similar
manner. Define a hybrid weight function
w?(x) = f−1(x)I(f(x) ≥ a) + a−1I(f(x) < a). (14)
We then have the following.
Theorem. Under Assumptions A1-A3 given in Appendix, if h? is given by (9), then
limn→∞
D(f(h?), f)
infh∈Hn D(f(h), f)= 1 a.s.
where D is any of DA, DI or DM with the weight function w given by (14).
Unlike the modified LCV, RLCV does not require that f is bounded above zero on a
compact support and the error criterion is minimized over the entire support of the underlying
density.
4 Specification of Threshold a
RLCV is reduced to LCV if we set a = 0. On the other hand, it is equivalent to LSCV
under a sufficiently large a. Barring these two extrema, the threshold a controls the trade-
12
off between efficiency and robustness inherent in RLCV. In this section, we present a simple
method to select this threshold.
Our method is motivated by Hall’s (1987) investigation into the interplay between tail-
heaviness of the kernel function and that of the underlying density in LCV. Hall (1987)
focused on univariate densities and considered a scenario wherein f(x) ∼ c|x|−α as |x| → ∞,
c > 0 and α > 1. Suppose that the kernel function takes the form
K(x) = A2 exp(−A1|x|κ), (15)
where A1, A2 and κ are positive constants such that K integrates to unity. Hall showed
that if κ > α − 1, LCV selects a bandwidth diverging to infinity and becomes inconsistent.
We employ in this study the Gaussian kernel, which is a member of (15) with κ = 2. Hall
(1987) suggested that LCV with a Gaussian kernel is consistent when the underlying density
is Gaussian or sub-Gaussian (the tails of a sub-Gaussian distribution decay at least as fast
as those of a Gaussian distribution).
In practice, tail-heaviness of the underlying density is usually unknown and often difficult
to estimate, especially for multivariate random variables. We therefore opt for a somewhat
conservative strategy that uses the Gaussian density as the benchmark. Denote by Mn the
extreme observation of an I.I.D. random sample {Zi}ni=1 from the d-dimensional Gaussian
distribution N (µ,Σ). We advocate the following simple rule: a = E[φ(Mn;µ,Σ)], where
φ(·;µ,Σ) is the density function of N (µ,Σ). Under this rule, if the estimated kernel den-
sity of a given observation is smaller than the expected density of sample extremum under
Gaussianity, LCV is deemed vulnerable. When this occurs, RLCV replaces the log-density
with its linear approximation, effectively weighting down its influence.
Assuming for simplicity that Σ is non-singular, we define the extremum of a Gaussian