Top Banner
THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest in probabilistic interpretations of kernel-based methods as well as learning in Banach spaces. The absence of a useful Lebesgue measure on an infinite- dimensional reproducing kernel Hilbert space is a serious obstacle for such stochastic models. We propose an estimation model for the ridge regression problem within the framework of abstract Wiener spaces and show how the support vector machine solution to such problems can be interpreted in terms of the Gaussian Radon transform. 1. Introduction A central task in machine learning is the prediction of unknown values based on ‘learning’ from a given set of outcomes. More precisely, suppose X is a non-empty set, the input space, and D = {(p 1 ,y 1 ), (p 2 ,y 2 ),..., (p n ,y n )}⊂X× R (1.1) is a finite collection of input values p j together with their corresponding real outputs y j . The goal is to predict the value y corresponding to a yet unobserved input value p ∈X . The fundamental assumption of kernel-based methods such as support vector machines (SVM) is that the decision or prediction function ˆ f λ belongs to a reproducing kernel Hilbert space (RKHS) H over X , corresponding to a positive definite function K : X×X→ R (see Section 2.1). In particular, in ridge regression, the Date : October 2013. 2010 Mathematics Subject Classification. Primary 44A12, Secondary 28C20, 62J07 . Key words and phrases. Gaussian Radon Transform, Ridge Regression, Kernel Methods, Abstract Wiener Space, Gaussian Process. Irina Holmes is a Dissertation Year Fellow at Louisiana State University, Depart- ment of Mathematics; [email protected]. Ambar Niel Sengupta, is Professor of Mathematics at Louisiana State University; [email protected]. 1
28

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

May 13, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINELEARNING

IRINA HOLMES AND AMBAR N. SENGUPTA

Abstract. There has been growing recent interest in probabilisticinterpretations of kernel-based methods as well as learning in Banachspaces. The absence of a useful Lebesgue measure on an infinite-dimensional reproducing kernel Hilbert space is a serious obstacle forsuch stochastic models. We propose an estimation model for the ridgeregression problem within the framework of abstract Wiener spacesand show how the support vector machine solution to such problemscan be interpreted in terms of the Gaussian Radon transform.

1. Introduction

A central task in machine learning is the prediction of unknown valuesbased on ‘learning’ from a given set of outcomes. More precisely, supposeX is a non-empty set, the input space, and

D = (p1, y1), (p2, y2), . . . , (pn, yn) ⊂ X × R (1.1)

is a finite collection of input values pj together with their correspondingreal outputs yj. The goal is to predict the value y corresponding toa yet unobserved input value p ∈ X . The fundamental assumption ofkernel-based methods such as support vector machines (SVM) is that the

decision or prediction function fλ belongs to a reproducing kernel Hilbertspace (RKHS) H over X , corresponding to a positive definite functionK : X × X → R (see Section 2.1). In particular, in ridge regression, the

Date: October 2013.2010 Mathematics Subject Classification. Primary 44A12, Secondary 28C20, 62J07

.Key words and phrases. Gaussian Radon Transform, Ridge Regression, Kernel

Methods, Abstract Wiener Space, Gaussian Process.Irina Holmes is a Dissertation Year Fellow at Louisiana State University, Depart-

ment of Mathematics; [email protected] Niel Sengupta, is Professor of Mathematics at Louisiana State University;

[email protected]

Page 2: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

2 IRINA HOLMES AND AMBAR N. SENGUPTA

predictor function fλ is the solution to the minimization problem:

fλ = arg minf∈H

(n∑j=1

(yj − f(pj))2 + λ‖f‖2

), (1.2)

where λ > 0 is a regularization parameter and ‖f‖ denotes the norm off in H. The regularization term λ‖f‖2 penalizes functions that ‘overfit’the training data.

The problem considered in (1.2) has a unique solution and this solutionis a linear combination of the functions Kpj = K(pj, ·). Specifically:

fλ =n∑j=1

cjKpj (1.3)

where c ∈ Rn is given by:

c = (KD + λIn)−1y (1.4)

where KD ∈ Rn×n is the matrix with entries [KD]ij = K(pi, pj), In isthe identity matrix of size n and y = (y1, . . . , yn) ∈ Rn is the vector ofinputs. We present a geometrical proof of this classic result in Theorem5.1.

Recently there has been a surge in interest in probabilistic interpreta-tions of kernel-based learning methods; see for instance [1, 12, 17, 18, 20,25]. Let us first consider a Bayesian perspective: suppose the data arisesas

yj = f(pj) + εj

where f is random, ε1, . . . , εn are independent Gaussian with mean 0 andunknown variance σ2 > 0, and are independent of f . This gives us aprior distribution:

P (y|p, f) =n∏j=1

1√2πσ2

e−1

2σ2(yj−f(pj))2

If the RKHS H under consideration is finite-dimensional we have stan-dard Gaussian measure on H, with density function:

P (f) =1

√2π

dim(H)e−

12‖f‖2

with respect to Lebesgue measure on H. This gives rise to the posteriordistribution:

P (f |x, y) ∝ e−1

2σ2

∑nj=1(yj−f(pj))2−

12‖f‖2 . (1.5)

Page 3: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 3

The maximum a posteriori (MAP) estimator then yields the same func-tion as in (1.3), with regularization parameter λ = σ2.

Equivalently, one may assume that the data arises as a sample path ofa Gaussian process, perturbed by a Gaussian measurement error:

y(p) = f(p) + ε(p)

where f(p) : p ∈ X is a Gaussian process indexed by X with mean 0and covariance K:

Cov(f(p), f(q)) = K(p, q),

and ε(p) : p ∈ X is a Gaussian process independent of f , with mean 0and covariance σ2I: Cov(ε(p), ε(q)) = σ2δp,q. Then y(p) : p ∈ X is alsoa Gaussian process, with mean 0 and covariance K + σ2I. The naturalchoice for the prediction y at a future input value p is the conditionalexpectation:

y = E [f(p)|y(p1) = y1, . . . , y(pn) = yn] . (1.6)

Then, as we show in Lemma 4.1,

y = [K(p, p1) . . . K(p, pn)] (KD + σ2In)−1y

=n∑j=1

cjKpj(p)

= fλ(p),

where fλ is again the SVM estimator in (1.3) with regularization param-eter λ = σ2.

Clearly the Bayesian approach does not work whenever H is infinite-dimensional, which is often the case, because of the absence of a usefulLebesgue measure in infinite dimensions. The SVM approach thoughstill goes through in infinite dimensions. Our goal in this article is toshow that so does the Gaussian process approach and, moreover, thatthey remain equivalent. Since the SVM solution in (1.3) is contained ina finite-dimensional subspace, the possibly infinite dimensionality of His less of a problem in this setting. However, if we want to predict basedon a stochastic model for f and H is infinite-dimensional, the absenceof Lebesgue measure becomes a significant problem. In particular, wecannot have the desired Gaussian process f(p) : p ∈ X on H itself.

The approach we propose is to work within the framework of abstractWiener spaces, introduced by L. Gross in the celebrated work [8]. Theconcept of abstract Wiener space was born exactly from this need fora “standard Gaussian measure” in infinite dimensions and has become

Page 4: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

4 IRINA HOLMES AND AMBAR N. SENGUPTA

the standard framework in infinite-dimensional analysis. We outline thebasics of this theory in Section 2.2 and showcase the essentials of the clas-sical Wiener space, as the “original” special case of an abstract Wienerspace, in Section 2.4.

Simply put, we construct a Gaussian measure with the desired proper-ties on a larger Banach space B that contains H as a dense subspace; thegeometry of this measure is dictated by the inner-product on H. Thisconstruction is presented in Section 2.3. We then show in Section 4 howthe ridge regression learning problem outlined above can be understoodin terms of the Gaussian Radon transform. The Gaussian Radon trans-form associates to a function f on B the function Gf , defined on the setof closed affine subspaces of H, whose value Gf(L) on an affine subspaceL ⊂ H is the integral

∫f dµL, where µL is a Gaussian measure on the

closure L of L in B obtained from the given Gaussian measure on B.This is explained in more detail in Section 3. For more on the GaussRadon transform we refer to the works [2, 3, 10,11,16].

In Section 5 we present a geometric view. First we present a geometricproof of the representer theorem for the SVM minimization problem andthen describe the relationship with the Gaussian Radon transform ingeometric terms.

Finally, another area that has recently seen strong activity is learningin Banach spaces; see for instance [6, 21, 24]. Of particular interest arereproducing kernel Banach spaces, introduced in [24], which are specialBanach spaces whose elements are functions. The Banach space B weuse in Section 4 is a completion of a reproducing kernel Hilbert space, butdoes not directly consist of functions. A notable exception is the classicalWiener space, where the Banach space is C[0, 1]. In Section 6 we addressthis issue and propose a realization of B as a space of functions.

2. Realizing covariance structures with Banach spaces

In this section we construct a Gaussian measure on a Banach spacealong with random variables defined on this space for which the covari-ance structure is specified in advance. In more detail, suppose X is anon-empty set and

K : X × X → R (2.1)

a function that is symmetric and positive definite (in the sense that thematrix [K(p, q)]p,q∈X is symmetric and positive definite); we will con-struct a measure µ on a certain Banach space B along with a family of

Page 5: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 5

Gaussian random variables Kp, for each p ∈ X , such that

K(p, q) = Cov(Kp, Kq) (2.2)

for all p, q ∈ X . A well-known choice for K(p, q), for p, q ∈ Rd, is givenby

e−|p−q|2

2s ,

where s > 0 is a scale parameter.The strategy is as follows: first we construct a Hilbert space H along

with elements Kp ∈ H, for each p ∈ X , for which 〈Kp, Kq〉 = K(p, q) forall p, q ∈ X . Next we describe how to obtain a Banach space B, equippedwith a Gaussian measure, along with random variables Kp that have therequired covariance structure (2.2). The first step is a standard result forreproducing kernel Hilbert spaces.

2.1. Constructing a Hilbert space from a covariance structure.For this we simply quote the well-known Moore-Aronszajn theorem (see,for example, Chapter 4 of Steinwart [22]) .

Theorem 2.1. Let X be a non-empty set and K : X ×X → R a functionfor which the matrix [K(p, q)]p,q∈X is symmetric and positive definite:

(i) K(p, q) = K(q, p) for all p, q ∈ X , and

(ii)∑N

j,k=1 cjckK(pj, pk) ≥ 0 holds for all integers N ≥ 1, all pointsp1, . . . , pN ∈ X , and all c1, . . . , cN ∈ R.

Then there is a unique Hilbert space H consisting of real-valued functionsdefined on the set X , containing the functions Kp = K(p, ·) for all p ∈ X ,with the inner product on H being such that

f(p) = 〈Kp, f〉 for all f ∈ H and p ∈ X . (2.3)

Moreover, the linear span of Kp : p ∈ X is dense in H.

The Hilbert space H is called the reproducing kernel Hilbert space(RKHS) over X with reproducing kernel K.

For the mapping

Φ : X → H : p 7→ Φ(p) = Kp,

known as the canonical feature map, we have

||Φ(p)− Φ(q)||2 = K(p, p)− 2K(p, q) +K(q, q) (2.4)

for all p, q ∈ X . Hence if X is a topological space and K is continuousthen so is the function (p, q) 7→ ||Φ(p) − Φ(q)||2 given in (2.4), and thevalue of this being 0 when p = q, it follows that p 7→ Φ(p) is continuous.

Page 6: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

6 IRINA HOLMES AND AMBAR N. SENGUPTA

In particular, if X has a countable dense subset then so does the imageΦ(X ), and since this spans a dense subspace of H it follows that H isseparable.

2.2. Gaussian measures on Banach spaces. The theory of abstractWiener spaces developed by Gross [8] provides the machinery

𝑭𝝐

F

𝒉 ∈ 𝑭 ∶ 𝒉 > 𝝐

Figure 1. A measur-able norm | · | on H

for constructing a Gaussian mea-sure on a Banach space B obtainedby completing a given Hilbertspace H using a special typeof norm | · | called a measur-able norm. Conversely, accord-ing to a fundamental result ofGross, every centered, any non-degenerate Gaussian measure on areal separable Banach space arisesin this way from completing an un-derlying Hilbert space called theCameron-Martin space. We workhere with real Hilbert spaces as acomplex structure plays no role inthe Gaussian measure in this context.

Definition 2.1. A norm | · | on a real separable Hilbert space H is saidto be a measurable norm provided that for any ε > 0, there is a finite-dimensional subspace Fε of H such that:

γF h ∈ F : |h| > ε < ε (2.5)

for every finite-dimensional subspace F of H with F ⊥ Fε, where γFdenotes standard Gaussian measure on F .

We denote the norm on H arising from the inner-product 〈·, ·〉 by|| · ||, which is not to be confused with a measurable norm | · |. Here arethree facts about measurable norms (for proofs see Gross [8], Kuo [15] orEldredge [7]):

(1) A measurable norm is always weaker than the original norm:there is c > 0 such that:

|h| ≤ c‖h‖, (2.6)

for all h ∈ H.(2) If H is infinite-dimensional, the original Hilbert norm ‖ · ‖ is not

a measurable norm.

Page 7: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 7

(3) If A is an injective Hilbert-Schmidt operator on H then

|h| = ‖Ah‖, for all h ∈ H (2.7)

specifies a measurable norm on H.

Henceforth we denote by B the Banach space obtained by completionof H with respect to | · |.

Any element f ∈ H gives a linear functional

H → R : g 7→ 〈f, g〉that is continuous with respect to the norm ‖ · ‖. Let

HB∗

be the subspace of H consisting of all f for which g 7→ 〈f, g〉 is continuouswith respect to the norm | · |. By fact (1) above, any linear functional onH that is continuous with respect to | · | is also continuous with respect to‖ · ‖ and hence is of the form 〈f, ·〉 for a unique f ∈ H by the traditionalRiesz theorem. Thus HB∗ consists precisely of those f ∈ H for which thelinear functional 〈f, ·〉 is the restriction to H of a (unique) continuouslinear functional on the Banach space B. We denote this extension of〈f, ·〉 to an element of B∗ by I(f):

I(f) = the continuous linear functional on B

agreeing with 〈f, ·〉 on H (2.8)

for all f ∈ HB∗ . Moreover, B∗ consists exactly of the elements I(f) forf ∈ HB∗ : for any φ ∈ B∗ the restriction φ|H is in H∗ (because of therelation (2.6)) and hence φ = I(f) for a unique f ∈ HB∗ .

The fundamental result of Gross [8] is that there is a unique Borelmeasure µ on B such that for every f ∈ HB∗ the linear functional I(f),viewed as a random variable on (B, µ), is Gaussian with mean 0 andvariance ||f ||2; thus: ∫

B

eiI(f) dµ = e−12||f ||2H (2.9)

for all f ∈ HB∗ ⊂ H. The mapping

I : HB∗ → L2(B, µ)

is a linear isometry.A triple (H,B, µ) where H is a real separable Hilbert space, B is the

Banach space obtained by completing H with respect to a measurablenorm, and µ is the Gaussian measure in (2.9), is called an abstract Wienerspace. The measure µ is known as Wiener measure on B.

Page 8: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

8 IRINA HOLMES AND AMBAR N. SENGUPTA

Suppose g ∈ H is orthogonal to HB∗ ; then for any φ ∈ B∗ we have

φ(g) = 〈f, g〉 = 0

where f is the element of HB∗ for which I(f) = φ. Since φ(g) = 0 for allφ ∈ B∗ it follows that g = 0. Thus,

HB∗ is dense in H. (2.10)

Consequently I extends uniquely to a linear isometry of H into L2(B, µ).We denote this extension again by I:

I : H → L2(B, µ). (2.11)

𝑩, ∙

𝑯, ∙

𝑯𝑩∗ ≅ 𝑩∗

Figure 2. AbstractWiener Space

It follows by taking L2-limits thatthe random variable I(f) is againGaussian, satisfying (2.9), for ev-ery f ∈ H.

It will be important for our pur-poses to emphasize again that iff ∈ H is such that the linearfunctional 〈f, ·〉 on H is continu-ous with respect to the norm | · |then the random variable I(f) isthe unique continuous linear func-tional on B that agrees with f onH. If 〈f, ·〉 is not continuous onH with respect to the norm | · |then I(f) is obtained by L2(B, µ)-approximating with elements of B∗.

2.3. Construction of the random variables Kp. As before, we as-sume given a model covariance structure

K : X × X → R.

We have seen how this gives rise to a Hilbert space H, which is the closedlinear span of the functions Kp = K(p, ·), with p running over X . Now let| · | be any measurable norm on H, and let B the Banach space obtainedby completion of H with respect to | · |. Let µ be the centered Gaussianmeasure on B as discussed above in the context of (2.9). We set

Kpdef= I(Kp), (2.12)

Page 9: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 9

for all p ∈ X ; thus Kp ∈ L2(B, µ). As a random variable on (B, µ)

the function Kp is Gaussian, with mean 0 and variance ‖Kp‖2, and thecovariance structure is given by:

Cov(Kp, Kq) = 〈Kp, Kq〉L2(µ)

= 〈Kp, Kq〉H (since I is an isometry)

= K(p, q).

(2.13)

Thus we have produced Gaussian random variables Kp, for each p ∈ X ,on a Banach space, with a given covariance structure.

2.4. Classical Wiener space. Let us look at the case of the classicalWiener space, as an example of the preceding structures. We take X =[0, T ] for some positive real T , and

KBM(s, t) = mins, t = 〈1[0,s], 1[0,t]〉L2 , (2.14)

for all s, t ∈ [0, T ], where the superscript refers to Brownian motion.Then KBM

s (x) is x for x ∈ [0, s] and is constant at s when x ∈ [s, T ]. Itfollows then that H0, the linear span of the functions KBM

s for s ∈ [0, T ],is the set of all functions on [0, T ] with initial value 0 and graph consistingof linear segments; in other words, H0 consists of all functions f on [0, T ]whose derivative exists and is locally constant outside a finite set ofpoints. The inner product on H0 is determined by observing that

〈KBMs , KBM

t 〉 = KBM(s, t) = mins, t; (2.15)

we can verify that this coincides with

〈f, g〉 =

∫ T

0

f ′(u)g′(u) du (2.16)

for all f, g ∈ H0. As consequence, H is the Hilbert space consisting offunctions f on [0, T ] that can be expressed as

f(t) =

∫ t

0

g(x) dx for all t ∈ [0, T ], (2.17)

for some g ∈ L2[0, T ]; equivalently, H consists of all absolutely continuousfunctions f on [0, T ], with f(0) = 0, for which the derivative f ′ existsalmost everywhere and is in L2[0, T ]. The sup norm |·|sup is a measurablenorm on H (see Gross [8, Example 2] or Kuo [15]). The completion Bis then C0[0, T ], the space of all continuous functions starting at 0. Iff ∈ H0 then I(f) is Gaussian of mean 0 and variance ‖f‖2, relative to theGaussian measure µ on B. We can check readily that KBM

t = I(KBMt )

Page 10: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

10 IRINA HOLMES AND AMBAR N. SENGUPTA

is Gaussian of mean 0 and variance t for all t ∈ [0, T ]. The processt 7→ KBM

t yields standard Brownian motion, after a continuous versionis chosen.

3. The Gaussian Radon Transform

The classical Radon transform Rf of a function f : Rn → R associatesto each hyperplane P ⊂ Rn the integral of f over P ; this is useful inscanning technologies and image reconstruction. The Gaussian Radontransform generalizes this to infinite dimensions and works with Gauss-ian measures (instead of Lebesgue measure, for which there is no usefulinfinite dimensional analog); the motivation for studying this transformcomes from the task of reconstructing a random variable from its condi-tional expectations. We have developed this transform in the setting ofabstract Wiener spaces in our work [10] (earlier works, in other frame-works, include [2, 3, 16]).

Let H be a real separable Hilbert space, and B the Banach spaceobtained by completing H with respect to a measurable norm | · |. LetM0 be a closed subspace of H. Then

Mp = p+M0 (3.1)

is a closed affine subspace of H, for any point p ∈ H. In [10, Theorem 2.1]we have constructed a probability measure µMp on B uniquely specifiedby its Fourier transform∫

B

eix∗dµMp = ei(p,x

∗)− 12‖PM0

x∗‖2 , for all x∗ ∈ B∗ ' HB∗ ⊂ H (3.2)

where PM0 denotes orthogonal projection on M0 in H. For our presentpurposes we formulate the description of the measure µMp in slightlydifferent terms. For every closed affine subspace L ⊂ H there is a Borelprobability measure µL on B and there is a linear mapping

I : H → L2(B, µL)

such that for every h ∈ H, the random variable I(h) satisfies∫B

eiI(h) dµL = ei〈fL,h〉−12‖Ph‖2 , for all h ∈ H, (3.3)

where fL is the point on L closest to the origin in H, and P : H → His the orthogonal projection operator onto the closed subspace −fL + L.Moreover, µL is concentrated on the closure of L in B:

µL(L) = 1,

Page 11: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 11

where L denotes the closure in B of the affine subspace L ⊂ H. (Notethat the mapping I depends on the subspace L. For more details aboutIL, see Corollary 3.1.1 and observation (v) following Theorem 2.1 in [11].)

From the condition (3.3), holding for all h ∈ H, we see that I(h) isGaussian with mean 〈h, fL〉 and variance ‖Ph‖2.

The Gaussian Radon transform Gf for a Borel function f on B is afunction defined on the set of all closed affine subspaces L in H by:

Gf(L)def=

∫B

f dµL. (3.4)

(Of course, Gf is defined if the integral is finite for all such L.)The value Gf(L) corresponds to the conditional expectation of f , the

conditioning being given by the closed affine subspace L. For a preciseformulation of this and proof for L being of finite codimension see thedisintegration formula given in [11, Theorem 3.1]. The following is an im-mediate consequence of [11, Corollary 3.2] and will play a role in Section4:

Proposition 3.1. Let (H,B, µ) be an abstract Wiener space and linearlyindependent elements h1, h2, . . . , hn of H. Let f be a Borel function onB, square-integrable with respect to µ, and let

F : Rn → R : (y1, . . . , yn) 7→ F (y1, . . . , yn) = Gf

(n⋂j=1

[〈hj, ·〉 = yj]

).

Then F(I(h1), I(h2), . . . , I(hn)

)is a version of the conditional expecta-

tion E [f |σ (I(h1), I(h2), . . . , I(hn))].

4. Machine Learning using the Gaussian Radon Transform

Consider again the learning problem outlined in the introduction: sup-pose X is a separable topological space and H is the RKHS over X withreproducing kernel K : X × X → R. Let:

D = (p1, y1), (p2, y2), . . . , (pn, yn) ⊂ X × R (4.1)

be our training data and suppose we want to predict the value y ata future value p ∈ X . Note that H is a real separable Hilbert spaceso, as outlined in Section 2.2, we may complete H with respect to ameasurable norm |·| to obtain the Banach space B and Gaussian measureµ. Moreover, recall from Section 2.3 that we constructed for every p ∈ Xthe random variable

Kp = IKp ∈ L2(B, µ), (4.2)

Page 12: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

12 IRINA HOLMES AND AMBAR N. SENGUPTA

where I : H → L2(B, µ) is the isometry in (2.11), and

Cov(Kp, Kq) = 〈Kp, Kq〉 = K(p, q) (4.3)

for all p, q ∈ X . Therefore Kp : p ∈ X is a centered Gaussian processwith covariance K.

Now since for every f ∈ H the value f(p) is given by 〈Kp, f〉, we work

with Kp(f) as our desired random-variable prediction. The first guesswould therefore be to use:

y = E[Kp|Kp1 = y1, . . . , Kpn = yn

](4.4)

as our prediction of the output y corresponding to the input p.We will now need the following observation.

Lemma 4.1. Suppose that (Z,Z1, Z2, . . . , Zn) is a centered Rn+1-valuedGaussian random variable on a probability space (Ω,F ,P) and let A ∈Rn×n be the covariance matrix:

Ajk = Cov(Zj, Zk), for all 1 ≤ j, k ≤ n,

and suppose that A is invertible. Then:

E [Z|Z1, Z2, . . . , Zn] = a1Z1 + a2Z2 + . . .+ anZn (4.5)

where a = (a1, . . . , an) ∈ Rn is given by

a = A−1z, (4.6)

where z ∈ Rn is given by zk = Cov(Z,Zk) for all 1 ≤ k ≤ n.

Proof. Let Z0 be the orthogonal projection of Z on the linear spanof Z1, . . . , Zn; thus Y = Z − Z0 is orthogonal to Z1, . . . , Zn and, ofcourse, (Y, Z1, . . . , Zn) is Gaussian. Hence, being all jointly Gaussian,the random variable Y is independent of (Z1, . . . , Zn). Then for anyS ∈ σ(Z1, . . . , Zn) we have

E[Z1S] = E[Z01S] + E[Y 1S]

= E[Z01S] + E[Y ]E[1S]

= E[Z01S].

(4.7)

Since this holds for all S ∈ σ(Z1, . . . , Zn), and since the random variableZ0, being a linear combination of Z1, . . . , Zn, is in σ(Z1, . . . , Zn), weconclude that

Z0 = E [Z |σ(Z1, . . . , Zn)] . (4.8)

Thus the conditional expectation of Z is the orthogonal projection Z0

onto the linear span of the variables Z1, . . . , Zn.

Page 13: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 13

WritingZ0 = a1Z1 + a2Z2 + . . .+ anZn,

we have

E[ZZj] = E[Z0Zj] =n∑k=1

E[ZjZk]ak = (Aa)j,

noting that Ajk = Cov(Zj, Zk) = E[ZjZk] since all these variables havemean 0 by hypothesis. Hence we have a = A−1z.

Then our prediction in (4.4) becomes:

y = [K(p1, p) . . . K(pn, p)] (KD)−1y

=n∑j=1

bjKpj(p)

where b = (KD)−1y and KD is, as in the introduction, the matrix withentries K(pi, pj). As we shall see in Theorem 5.2, the function

∑nj=1 bjKpj

is the element f0 ∈ H of minimal norm such that f0(pi) = yi for all1 ≤ i ≤ n. Therefore (4.4) is the solution in the traditional splinesetting, and does not take into account the regularization parameter λ.However, the next theorem shows that the ridge regression solution canbe obtained in terms of the Gaussian Radon transform, by taking theGaussian process approach outlined in the introduction.

Theorem 4.1. Let H be the RKHS over a separable topological space Xwith real-valued reproducing kernel K and B be the completion of H withrespect to a measurable norm | · | with Wiener measure µ. Let

D = (p1, y1), (p2, y2), . . . , (pn, yn) ⊂ X × R (4.9)

be fixed and p ∈ X. Let e1, e2, . . . , en ⊂ H be an orthonormal set suchthat:

e1, . . . , en ⊂ [spanKp1 , . . . , Kpn , Kp]⊥ (4.10)

and for every 1 ≤ j ≤ n let ej = I(ej) where I : H → L2(B, µ) is theisometry in (2.11). Then for any λ > 0:

E[Kp|Kp1 +

√λe1 = y1, . . . , Kpn +

√λen = yn

]= fλ(p) (4.11)

where fλ is the solution to the ridge regression problem in (1.2) withregularization parameter λ. Consequently:

fλ(p) = GKp

(n⋂j=1

[⟨Kpj +

√λej, ·

⟩= yj

])(4.12)

Page 14: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

14 IRINA HOLMES AND AMBAR N. SENGUPTA

where GKp is the Gaussian Radon transform of Kp.

Note that a completely precise statement of the relation (4.11) is as

follows. Let us for the moment write the quantity fλ(p), involving thevector y, as

fλ(p; y).

Then

(p;(Kp1 +

√λe1, . . . , Kpn +

√λen

))(4.13)

is a version of the conditional expectation

E[Kp |σ

(Kp1 +

√λe1, . . . , Kpn +

√λen

)]. (4.14)

Proof. By our assumption in (4.10):

Cov(Kpi +

√λej, Kpj +

√λej

)=

⟨Kpi +

√λei, Kpj +

√λej

⟩= K(pi, pj) + λδi,j

for all 1 ≤ i, j ≤ n. We note that this covariance matrix is invertible (seethe argument preceding equation (5.16) below). Moreover,

Cov(Kp, Kpj +

√λej

)= K(pj, p)

for all 1 ≤ j ≤ n. Then by Lemma 4.1:

E[Kp|Kp1 +

√λe1 = y1, . . . , Kpn +

√λen = yn

]= [K(p1, p) . . . K(pn, p)] (KD + λIn)−1y

=n∑j=1

cjKpj(x) where c = (KD + λIn)−1y

= fλ(p)

Finally, (4.12) follows directly from Proposition 3.1.

The interpretation of the predicted value in terms of the GaussianRadon transform allows for quite a broad class of functions that canbe considered for prediction. As a simple example, consider the taskof predicting the maximum value of an unknown function over a futureperiod using knowledge from the training data. For this we take boththe RKHS and the Banach space to be spaces of real-valued functions onX (see section 6 for more on this); the predicted value would be

GF (L) (4.15)

Page 15: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 15

where L is the closed subspace of the RKHS reflecting the training data,and F is, for example, a function of the form

F (b) = supp∈S

Kp(b) (4.16)

for some given set S of ‘future dates’. We note that the prediction (4.15)is, in general, different from

supp∈S

GKp(L),

where GKp(L) is the SVM prediction as in (4.12); in other words, theprediction (4.15) is not the same as simply taking the supremum overthe predictions given by the SVM minimizer. We note also that in thistype of problem the Hilbert space, being a function space, is necessarilyinfinite-dimensional.

5. A geometric formulation

In this section we present a geometric view of the relationship betweenthe Gaussian Radon transform and the representer theorem used in sup-port vector machine theory; thus, this will be a geometric interpretationof Theorem 4.1.

Given a reproducing kernel Hilbert space H of functions defined ona set X with reproducing kernel K : X × X → R, we wish to find afunction fλ ∈ H that minimizes the functional

Eλ(f) =n∑j=1

[yj − f(pj)]2 + λ‖f‖2, (5.1)

where p1, . . . , pn are given points in X , y1, . . . , yn are given values in R,and λ > 0 is a parameter.

Our first goal in this section is to present a geometric proof of the fol-lowing representer theorem widely used in support vector machine theory.The result has its roots in the work of Kimeldorf and Wahba [13,14] (forexample, [13, Lemmas 2.1 and 2.2]) in the context of splines; in this con-text it is also worth noting the work of de Boor and Lynch [5] whereHilbert space methods were used to study splines.

Theorem 5.1. With notation and hypotheses as above, there is a uniquefλ ∈ H such that Eλ(fλ) is inff∈H Eλ(f). Moreover, fλ is given explicitly

Page 16: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

16 IRINA HOLMES AND AMBAR N. SENGUPTA

by

fλ =n∑i=1

ciKpi (5.2)

where the vector c ∈ Rn is (KD+λIn)−1y, with KD being the n×n matrixwith entries [KD]ij = K(pi, pj) and y = (y1, . . . , yn) ∈ Rn.

Proof. It will be convenient to scale the inner-product 〈·, ·〉 of H by λ.Consequently, we denote by Hλ the space H with inner-product:

〈f, g〉Hλ = λ〈f, g〉, for all f, g ∈ H. (5.3)

We shall use the linear mapping

T : Rn → Hλ (5.4)

that maps ei to λ−1Kpi for i ∈ 1, . . . , n, where e1, . . . , en is thestandard basis of Rn:

T (ei) = λ−1Kpi for i ∈ 1, . . . , n.We observe then that for any f ∈ Hλ

〈T ∗f, ei〉Rn = 〈f, Tei〉Hλ = λ〈f, λ−1Kpi〉 = f(pi) (5.5)

for each i, and so

T ∗f =n∑j=1

f(pj)ej. (5.6)

Consequently, we can rewrite Eλ(f) as

Eλ(f) = ‖y − T ∗f‖2Rn + ‖f‖2Hλ , (5.7)

and from this we see that Eλ(f) has a geometric meaning as the distancefrom the point (f, T ∗f) ∈ Hλ ⊕ Rn to the point (0, y) in Hλ ⊕ Rn:

Eλ(f) = ‖(0, y)− (f, T ∗f)‖2Hλ⊕Rn . (5.8)

Thus the minimization problem for Eλ(·) is equivalent to finding thepoint on the subspace

M = (f, T ∗f) : f ∈ Hλ ⊂ Hλ ⊕ Rn (5.9)

closest to (0, y). Now the subspace M is just the graph Gr(T ∗) and it isa closed subspace of Hλ⊕Rn because it is the orthogonal complement ofa subspace (as we see below in (5.13)). Hence by standard Hilbert spacetheory there is indeed a unique point on M that is closest to (0, y), andthis point is in fact of the form

(fλ, T∗fλ) = (0, y) + (a, b), (5.10)

Page 17: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 17

𝒏

0, 𝑦

𝑯

𝑓, 𝑇∗𝑓

𝒏

0, 𝑦

𝑯

𝑓 , 𝑇∗𝑓

Figure 3. A geometric interpretation of Theorem 5.1

where the vector (a, b) ∈ Hλ⊕Rn is orthogonal to M . Now the conditionfor orthogonality to M means that

〈(a, b), (f, T ∗f)〉Hλ⊕Rn = 0 for all f ∈ H,

and this is equivalent to

0 = 〈a, f〉Hλ + 〈b, T ∗f〉Rn = 〈a+ Tb, f〉Hλ = λ〈a+ Tb, f〉for all f ∈ H. Therefore

a+ Tb = 0. (5.11)

Thus,[Gr(T ∗)]⊥ = (−Tc, c) : c ∈ Rn. (5.12)

Conversely, we can check directly that

Gr(T ∗) = (−Tc, c) : c ∈ Rn⊥. (5.13)

Returning to (5.10) we see that the point on M closest to (0, y) is ofthe form

(fλ, T∗fλ) = (a, y + b) = (−Tb, y + b) (5.14)

for some b ∈ Rn. Since the second component is T ∗ applied to the first,we have

y + b = −T ∗Tb,and solving for b we obtain

b = −(T ∗T + In)−1y. (5.15)

Note that the operator T ∗T + In on Rn is invertible, since 〈(T ∗T +In)w,w〉Rn ≥ ‖w‖2Rn , so that if (T ∗T + In)w = 0 then w = 0. Then

from (5.14) we have fλ = a = −Tb given by

fλ = T[(T ∗T + In)−1y

]. (5.16)

Page 18: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

18 IRINA HOLMES AND AMBAR N. SENGUPTA

Now we just need to write this in coordinates. The matrix for T ∗T hasentries

〈(T ∗T )ei, ej〉Rn = 〈Tei, T ej〉Rn= 〈λ−1Kpi , λ

−1Kpj〉Hλ= λ〈λ−1Kpi , λ

−1Kpj〉= λ−1[KD]ij,

(5.17)

and so

fλ = T[(T ∗T + In)−1y

]= T

[n∑

i,j=1

(In + λ−1KD)−1ij yjei

].

Since Tei = λ−1Kpi , we can write this as

fλ =n∑i=1

[n∑j=1

(In + λ−1KD)−1ij yj

]λ−1Kpi ,

which simplifies readily to (5.2).

The observations about the graph Gr(T ∗) used in the preceding proofare in the spirit of the analysis of adjoints of operators carried out byvon Neumann [23].

With fλ being the minimizer as above, we can calculate the minimumvalue of E(·):

E(fλ) = ‖(0, y)− (fλ, T∗fλ)‖2Hλ⊕Rn

= ‖(a, b)‖2Hλ⊕Rn (using (5.10))

= 〈Tb, T b〉Hλ + 〈b, b〉Rn= 〈T ∗Tb, b〉Rn + 〈b, b〉Rn= 〈(T ∗T + In)b, b〉Rn= 〈 − y, b〉Rn (using (5.15))

= ‖(T ∗T + I)−1/2y‖2Rn (again using (5.15).)

(5.18)

Page 19: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 19

It is useful to keep in mind that our definition of T in (5.4), and hence

of T ∗, depends on λ. We note that the norm squared of fλ itself is

‖fλ‖2 = λ−1〈T (T ∗T + In)−1y, T (T ∗T + In)−1y〉Hλ= λ−1‖(T ∗T )1/2(T ∗T + I)−1y‖2Hλ

= ‖n∑j=1

λ−1/2K1/2D (λ−1KD + In)−1yjλ

−1Kpj‖2

= λ−1‖n∑j=1

K1/2D (KD + λIn)−1yjKpj‖2

(5.19)

Let us now turn to the traditional spline setting. A function f ∈ H,whose graph passes through the training points (pi, yi), for i ∈ 1, . . . , n,of minimum norm has to be found. We present here a geometrical de-scription in the spirit of Theorem 5.1. This is in fact the result one wouldobtain by formally taking λ = 0 in Theorem 5.1.

Theorem 5.2. Let H be a reproducing kernel Hilbert space of functionson a set X , and let (p1, y1), . . . , (pn, yn) be points in X × R. Let K :X ×X → R be the reproducing kernel for H, and let Kq : X → R : q′ 7→K(q, q′), for every q ∈ X . Assume that the functions Kp1 , . . . , Kpn arelinearly independent. Then, for any y = (y1, . . . , yn) ∈ Rn, the elementin

f ∈ H : f(p1) = y1, . . . , f(pn) = ynof minimum norm is given by

f0 =n∑i=1

c0iKpi , (5.20)

where c0 = (c1, . . . , cn) = K−1D y, with KD being the n × n matrix whose(i, j)-th entry is K(pi, pj).

The assumption of linear independence of the Kpi is simply to ensurethat there does exist a function f ∈ H with values yi at the points pi.

Proof. Let T0 : Rn → H be the linear mapping specified by T0ei = Kpi ,for i ∈ 1, . . . , n. Then the adjoint T ∗ : H → Rn is given explicitly by

T ∗0 f =n∑i=1

〈f,Kpi〉ei =n∑i=1

f(pi)ei

Page 20: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

20 IRINA HOLMES AND AMBAR N. SENGUPTA

and so

f ∈ H : f(p1) = y1, . . . , f(pn) = yn = f ∈ H : T ∗0 f = y. (5.21)

Since the linear functionals Kpi are linearly independent, no nontriviallinear combination of them is 0 and so the only vector in Rn orthogonalto the range of T ∗0 is 0; this

Im(T ∗0 ) = Rn. (5.22)

Let f0 be the point on the closed affine subspace in (5.21) that is nearest

the origin in H. Then f0 is the point on f ∈ H : T ∗0 f = y orthogonalto kerT ∗0 .

𝑯

𝒇 𝟎

𝑲𝒆𝒓 𝑻𝟎∗

𝒇 ∈ 𝑯: 𝑻𝟎∗ 𝒇 = 𝒚

Figure 4. The point on f ∈ H : T ∗0 f = y closest to the origin.

Now it is a standard observation that

kerT ∗0 = [Im(T0)]⊥ .

(If T ∗0 v = 0 then 〈v, T0w〉 = 〈T ∗0 v, w〉 = 0, so that kerT ∗0 ⊂ [Im(T0)]⊥;

conversely, if v ∈ [Im(T0)]⊥ then 〈T ∗0 v, w〉 = 〈v, T0w〉 = 0 for all w ∈ H

and so T ∗0 v = 0.) Hence (kerT ∗0 )⊥ is the closure of Im(T0). Now Im(T0)is a finite-dimensional subspace of H and hence is closed; therefore

(kerT ∗0 )⊥ = Im(T0). (5.23)

Returning to our point f0 we conclude that f0 ∈ Im(T0). Thus, f0 = T0g

for some g ∈ H. The requirement that T ∗0 f0 be equal to y means that(T ∗0 T0)g = y, and so

f0 = T0(T∗0 T0)

−1y. (5.24)

Page 21: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 21

We observe here that T ∗0 T0 : Rn → Rn is invertible because any v ∈ker(T ∗0 T0) satisfies ‖T0v‖2 = 〈T ∗0 T0v, v〉 = 0, so that ker(T ∗0 T0) = ker(T0),and this is 0, again by the linear independence of the functions Kpi .The matrix for T ∗0 T0 is just the matrix KD because its (i, j)-th entry is

(T ∗0 T0)ij = 〈Tei, T ej〉 = 〈Kpi , Kpj〉 = K(pi, pj) = (KD)ij.

Thus f0 works out to∑n

i,j=1(K−1D )ijyjKpi .

Working in the setting of Theorem 5.2, and assuming that H is separa-ble, let B be the Banach space obtained as completion of H with respectto a measurable norm. Recall from (3.3) that the Gaussian Radon trans-form of a function F on H is the function on the set of closed affinesubspaces of H given by

GF (L) =

∫F dµL, (5.25)

where L is any closed subspace of H, and µL is the Borel measure on Bspecified by the Fourier transform∫

B

eiI(h) dµL = ei〈h,f0〉−12‖Ph‖2 , for all h ∈ H, (5.26)

wherein f0 is the point on L closest to the origin in H and P is theorthogonal projection onto the closed subspace L− f0. Let us apply thisto the closed affine subspace

L = f ∈ H : f(p1) = y1, . . . , f(pn) = yn.From equation (5.26) we see that I(h) is a Gaussian variable with mean

〈h, f0〉 and variance ‖Ph‖2. Now let us take for h the function Kp ∈ H,for any point p ∈ X ; then I(Kp) is Gaussian, with respect to the measureµL, with mean

EµL(Kp) = 〈Kp, f0〉 = f0(p), (5.27)

where Kp is the random variable I(Kp) defined on (B, µL). The function

f0 is as given in (5.20). Now consider the special case where p = pi fromthe training data. Then

EµL(Kpi) = f0(pi) = yi,

because f0 belongs to L, by definition. Moreover, the variance of Kpi

is the norm-squared of the orthogonal projection of Kpi onto the closedsubspace

L0 = L− f0 = f ∈ H : f(p1) = 0, . . . , f(pn) = 0.

Page 22: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

22 IRINA HOLMES AND AMBAR N. SENGUPTA

However, for any f ∈ L we have

〈f − f0, Kpi〉 = f(pi)− f0(pi) = 0,

and so the variance of Kpi is 0. Thus, with respect to the measure µL,

the functions Kpi take the constant values yi almost everywhere. This isanalogous to our result Theorem 4.1; in the present context the conclusionis

f0(p) = E[Kp | Kp1 = y1, . . . , Kp1 = y1

]. (5.28)

We can also obtain a geometrical perspective on Theorem 4.1 itself, byworking with the closed affine subspace of Hλ ⊕ Rn given by

L1 = −(0, y) + Gr(T ∗).

The point in this set closest to (0, 0) is

v = −(0, y) + (fλ, T∗fλ), (5.29)

where (fλ, T∗fλ) is the point on Gr(T ∗) closest to (0, y). Thus,

v = (fλ,−y + T ∗fλ).

Since Gr(T ∗)⊥ is the subspace (Tc,−c) : c ∈ Rn it has basis vectors(Tei,−ei) for i ∈ 1, . . . , n. We observe that for any (f, z) ∈Mp:

〈(f, z), (Tei,−ei)〉Hλ⊕Rn = 〈(0,−y), (Tei,−ei)〉Hλ⊕Rn = yi.

Thus,

L1 = ∩ni=1(f, z) ∈ Hλ ⊕ Rn : 〈(f, z), (Tei,−ei)〉Hλ⊕Rn = yi.

Using the procedure of Section 2.2, we complete Hλ⊕Rn with respectto a measurable norm | · | to obtain a Banach space B. Now considerthe measure µL1 , specified in (3.2), for the closed affine subspace L1 ofHλ⊕Rn, and the map I : Hλ⊕Rn → L2(B, µL1). For every h ∈ Hλ⊕Rn,IL1(h) is Gaussian with mean 〈v, h〉, with v being the point of L1 closestto (0, 0). With this in mind, take (λ−1Kq, 0) for h, where q ∈ X andKq ∈ H. We have

〈(λ−1Kq, 0), p〉Hλ⊕Rn = 〈(λ−1Kq, 0), (fλ,−y + T ∗fλ)〉Hλ⊕Rn = fλ(q).

So the “predicted” value fλ(q) is exactly the mean of the random variableIL1(λ

−1Kq, 0) on B with respect to the measure µL1 .

Page 23: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 23

6. Realizing B as a space of functions

In this section we present some results of a somewhat technical natureto address the question as to whether the elements of the Banach spaceB can be viewed as functions.

6.1. Continuity of Kp. A general measurable norm on H does not‘know’ about the kernel function K and hence there seems to be no reasonwhy the functionals 〈Kp, ·〉 on H would be continuous with respect to thenorm | · |. To remedy this situation we prove that there exist measurablenorms on H relative to which the functionals 〈Kp, ·〉 are continuous forp running along a dense sequence of points in X :

Proposition 6.1. Let H be the reproducing kernel Hilbert space asso-ciated to a continuous kernel function K : X × X → R, where X is aseparable metric space. Let D be a countable dense subset of X . Thenthere is a measurable norm | · | on H with respect to which 〈Kp, ·〉 is acontinuous linear functional for every p ∈ D.

Proof. Let D consist of the points p1, p2, . . ., forming a dense subsetof X . By the Gram-Schmidt process we obtain an orthonormal basise1, e2, . . . of H such that Kpn is contained in the span of e1, . . . , en, forevery n. Now consider the bounded linear operator A : H → H specifiedby requiring that Aen = 1

nen for all n; this is Hilbert-Schmidt because∑

n ‖Aen‖2 < ∞ and is clearly injective. Hence, by the Property (3)discussed in the context of (2.7),

|f | def= ‖Af‖ =

[∑n

1

n2〈en, f〉2

]1/2for all f ∈ H

specifies a measurable norm on H. Then

|〈en, f〉| ≤ n|f |,from which we see that the linear functional 〈en, ·〉 on H is continuouswith respect to the norm | · |. Hence, by definition of the linear isometryI : H → L2(B, µ) given in (2.8), I(en) is the element in B∗ that agreeswith 〈en, ·〉 on H. In particular each I(en) is continuous and hence Kp =

I(Kp) is a continuous linear functional on B for every p ∈ D. QED

The measurable norm | · | we have constructed in the preceding proofarises from a (new) inner-product on H. However, given any other mea-surable norm | · |0 on H the sum

| · |′ def= | · |+ | · |0

Page 24: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

24 IRINA HOLMES AND AMBAR N. SENGUPTA

is also a measurable norm (not necessarily arising from an inner-product)and the linear functional 〈Kp, ·〉 : H → R is continuous with respect tothe norm | · |′ for every p ∈ D.

6.2. Elements of B as functions. If a Banach space B is obtained bycompleting a Hilbert space H of functions, the elements of B need notconsist of functions. However, when H is a reproducing kernel Hilbertspace as we have been discussing and under reasonable conditions on thereproducing kernel function K it is true that elements of B can ‘almost’be thought of as functions on X . For this we first develop a lemma:

Lemma 6.1. Suppose H is a separable real Hilbert space and B theBanach space obtained by completing H with respect to a measurablenorm | · |. Let B0 be a closed subspace of B that is transverse to H in thesense that H ∩B0 = 0, and let

B1 = B/B0 = b+B0 : b ∈ B (6.1)

be the quotient Banach space, with the standard quotient norm, given by

|b+B0|1def= inf|x| : x ∈ b+B0. (6.2)

Then the mapping

ι1 : H → B/B0 : h 7→ ι(h) +B0, (6.3)

where ι : H → B is the inclusion, is a continuous linear injective map,and

|h|1def= |ι1(h)| for all h ∈ H, (6.4)

specifies a measurable norm on H. The image of H under ι1 is a densesubspace of B1.

Proof. Let us first note that by definition of the quotient norm

|b+B0|1 ≤ |b| for all b ∈ B.

Hence|h|1 ≤ |h| for all h ∈ H ⊂ B.

Let ε > 0. Then since | · | is a measurable norm on H there is a finite-dimensional subspace Fε ⊂ H such that if F is any finite-dimensionalsubspace of H orthogonal to Fε then, as noted back in (2.5),

γFh ∈ F : |h| > ε < ε,

where γF is standard Gaussian measure on F . Then

γFh ∈ F : |h|1 > ε ≤ γFh ∈ F : |h| > ε< ε,

(6.5)

Page 25: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 25

where the first inequality holds because whenever |h|1 > ε we also have|h| ≥ |h|1 > ε. Thus, | · |1 is a measurable norm on H. The image ι1(H)is the same as the projection of the dense subspace H ⊂ B onto thequotient space B1 = B/B0 and hence this image is dense in B1 (an openset in the complement of ι1(H) would have inverse image in B that is inthe complement of H, and would have to be empty because H is densein B). QED

We can now establish the identification of B as a function space.

Proposition 6.2. Let K : X×X → R be a continuous function, symmet-ric and non-negative definite, where X is a separable topological space,and D a countable dense subset of X . Let H be the corresponding repro-ducing kernel Hilbert space. Then there is a measurable norm | · |1 on Hsuch that the Banach space B1 obtained by completing H with respect to| · |1 can be realized as a space of functions on the set D.

Proof. Let B be the completion of H with respect to a measurablenorm | · | of the type given in Proposition 6.1. Thus 〈Kp, ·〉 : H → R iscontinuous with respect to | · | when p ∈ D; let

Kp : B → R

be the continuous linear extension of Kp to the Banach space B, forp ∈ D. Now let

B0def= ∩p∈D ker Kp, (6.6)

a closed subspace of B. We observe that B0 is transverse to H; for ifx ∈ B0 is in H then 〈Kp, x〉 = 0 for all p ∈ D and so x = 0 sinceKp : p ∈ D spans a dense subspace of H as noted in Theorem 2.1 andthe remark following it. Then by Lemma 6.1, B1 = B/B0 is a Banachspace that is a completion ofH in the sense that ι1 : H → B1 : h 7→ h+B0

is continuous linear with dense image and |h|1def= |ι1(h)|1, for h ∈ H,

specifies a measurable norm on H. Let K1p be the linear functional on

B1 induced by Kp:

K1p(b+B0)

def= Kp(b) for all b ∈ B, (6.7)

which is well-defined because the linear functional Kp is 0 on B0. Wenote that

K1p

(ι1(h)

)= K1

p(h+B0) = Kp(h) = 〈Kp, h〉for all h ∈ H, and so K1

p is the continuous linear ‘extension’ of 〈Kp, ·〉 toB1 through ι1 : H → B1, viewed as an ‘inclusion’ map.

Page 26: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

26 IRINA HOLMES AND AMBAR N. SENGUPTA

Now to each b1 ∈ B1 associate the function fb1 on D given by:

fb1 : D → R : p 7→ K1p(b1).

We will shows that the mapping

b1 7→ fb1

is injective; thus it realizes B1 as a set of functions on the set D. To thisend, suppose that

fb+B0 = fc+B0

for some b, c ∈ B. This means

K1p(b+B0) = K1

p(c+B0) for all p ∈ D,

and so Kp(b) = Kp(c) for all p ∈ D. Then

b− c ∈ ker Kp for all p ∈ D.

Thus b − c ∈ B0 and so b + B0 = c + B0. Thus we have shown thatb1 7→ fb1 is injective. QED

We have defined the function fb on the set D ⊂ X , with notation andhypotheses as above. Now taking a general point p ∈ X and a sequenceof points pn ∈ D converging to p the function Kp on B is the L2(µ)-

limit the sequences of functions Kpn . Thus we can define fb(p) = Kp(b),with the understanding that for a given p ∈ X , the value fb(p) is µ-almost-surely defined in terms of its dependence on b. In the theory ofGaussian random fields one has conditions on the covariance function Kthat ensure that D × B → R : p 7→ Kp(b) is continuous in p for µ-a.e.b ∈ B, and in this case the function fb extends uniquely to a continuousfunction on X , for µ-almost-every b ∈ B.

Acknowledgments. This work is part of a research project coveredby NSA grant H98230-13-1-0210. I. Holmes would like to express hergratitude to the Louisiana State University Graduate School for awardingher the LSU Graduate School Dissertation Year Fellowship, which mademost of her contribution to this work possible. We thank Kalyan B.Sinha for useful discussions.

References

[1] Aleksandr Y Aravkin, Bradley M. Bell, James V. Burke, and Gianluigi Pillonetto,The connection between Bayesian estimation of a Gaussian random field andRKHS, Submitted to IEEE Transactions on Neural Networks and Learning Sys-tems, available at http://arxiv.org/pdf/1301.5288v3.pdf.

Page 27: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING 27

[2] Jeremy J. Becnel, The support theorem for the Gauss-Radon transform, Infin.Dimens. Anal. Quantum Probab. Relat. Top. 15 (2012), no. 2, 1250006, 21, DOI10.1142/S0219025712500063. MR2957133

[3] Jeremy J. Becnel and Ambar N. Sengupta, A support theorem for a GaussianRadon transform in infinite dimensions, Trans. Amer. Math. Soc. 364 (2012),no. 3, 1281–1291, DOI 10.1090/S0002-9947-2011-05365-1, available at https:

//www.math.lsu.edu/~sengupta/papers/RGBecSenDec09.pdf. MR2869177[4] Vladimir I. Bogachev, Gaussian measures, Mathematical Surveys and Mono-

graphs, vol. 62, American Mathematical Society, Providence, RI, 1998.MR1642391 (2000a:60004)

[5] Carl de Boor and Robert E. Lynch, On splines and their minimum properties, J.Math. Mech. 15 (1966), 953–969. MR0203306 (34 #3159)

[6] Ricky Der and Daniel Lee, Large-Margin Classification in Banach Spaces, Pro-ceedings of the Eleventh International Conference on Artificial Intelligence andStatistics (AISTATS-07), 2007, pp. 91-98, available at http://jmlr.csail.mit.edu/proceedings/papers/v2/der07a/der07a.pdf.

[7] Nate Eldredge, Math 7770: Analysis and Probability on Infinite-DimensionalSpaces, Online class notes (2012), available at http://www.math.cornell.edu/

~neldredge/7770/7770-lecture-notes.pdf.[8] Leonard Gross, Abstract Wiener spaces, Proc. Fifth Berkeley Sympos. Math.

Statist. and Probability (Berkeley, Calif., 1965/66), Vol. II: Contributions toProbability Theory, Part 1, Univ. California Press, Berkeley, Calif., 1967, pp. 31–42. MR0212152 (35 #3027)

[9] Sigurdur Helgason, The Radon transform, 2nd ed., Progress in Mathematics,vol. 5, Birkhauser Boston Inc., Boston, MA, 1999. MR1723736 (2000m:44003)

[10] Irina Holmes and Ambar N. Sengupta, A Gaussian Radon transform forBanach spaces, J. Funct. Anal. 263 (2012), no. 11, 3689–3706, DOI10.1016/j.jfa.2012.09.005, available at http://arxiv.org/pdf/1208.5743v2.

pdf. MR2984081[11] Irina Holmes, An inversion formula for the Gaussian Radon transform for Ba-

nach spaces, Infin. Dimens. Anal. Quantum Probab. Relat. Top. 16 (2013), no. 4,available at http://arxiv.org/pdf/1308.1392.pdf.

[12] Ferenc Huszar and Simon Lacoste-Julien, A Kernel Approach to TractableBayesian Nonparametrics, arXiv preprint (2011), available at http://arxiv.

org/pdf/1103.1761v3.pdf.[13] George S. Kimeldorf and Grace Wahba, A correspondence between Bayesian es-

timation on stochastic processes and smoothing by splines, Ann. Math. Statist.41 (1970), 495–502, available at http://www.stat.wisc.edu/~wahba/ftp1/

oldie/kw70bayes.pdf. MR0254999 (40 #8206)[14] , Some Results on Tchebycheffian Spline Functions, Journal of Mathe-

matical Analysis and Applications 33 (1971), no. 1, 82–95, available at http:

//www.stat.wisc.edu/~wahba/ftp1/oldie/kw71.pdf.[15] Hui Hsiung Kuo, Gaussian measures in Banach spaces, Lecture Notes in Math-

ematics, Vol. 463, Springer-Verlag, Berlin, 1975. MR0461643 (57 #1628)[16] Vochita Mihai and Ambar N. Sengupta, The Radon-Gauss transform, Soochow

J. Math. 33 (2007), no. 3, 415–433. MR2344371 (2009a:44006)

Page 28: THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING · THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest

28 IRINA HOLMES AND AMBAR N. SENGUPTA

[17] Umut Ozertem and Deniz Erdogmus, RKHS Bayes Discriminant: A Sub-space Constrained Nonlinear Feature Projection for Signal Detection, Neu-ral Networks, IEEE Transactions on 20 (2009), no. 7, 1195-1203, DOI10.1109/TNN.2009.2021473.

[18] Natesh S. Pillai, Qiang Wu, Feng Liang, Sayan Mukherjee, and Robert L.Wolpert, Characterizing the function space for Bayesian kernel models, J.Mach. Learn. Res. 8 (2007), 1769–1797, available at http://jmlr.org/papers/volume8/pillai07a/pillai07a.pdf. MR2332448

[19] Johann Radon, Uber die Bestimmung von Funktionen durch ihre Integralwertelangs gewisser Mannigfaltigkeiten, Computed tomography (Cincinnati, Ohio,1982), Proc. Sympos. Appl. Math., vol. 27, Amer. Math. Soc., Providence, R.I.,1982, pp. 71–86 (German). MR692055 (84f:01040)

[20] Peter Sollich, Bayesian Methods for Support Vector Machines: Evidenceand Predictive Class Probabilities, Mach. Learn. 46 (2002), no. 1-3, 21–52,DOI 10.1023/A:1012489924661, available at http://dx.doi.org/10.1023/A:

1012489924661.[21] Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R. G. Lanckriet, Learn-

ing in Hilbert vs. Banach Spaces: A Measure Embedding Viewpoint., NIPS,2011, pp. 1773-1781, available at http://cosmal.ucsd.edu/~gert/papers/

nips_rkbs_11.pdf.[22] Ingo Steinwart and Andreas Christmann, Support vector machines, Information

Science and Statistics, Springer, New York, 2008. MR2450103 (2010f:62002)

[23] J. von Neumann, Uber adjungierte Funktionaloperatoren, Ann. of Math. (2) 33(1932), no. 2, 294–310, DOI 10.2307/1968331 (German). MR1503053

[24] Haizhang Zhang, Yuesheng Xu, and Jun Zhang, Reproducing kernel Banachspaces for machine learning, J. Mach. Learn. Res. 10 (2009), 2741–2775,DOI 10.1109/IJCNN.2009.5179093, available at http://machinelearning.

wustl.edu/mlpapers/paper_files/jmlr10_zhang09b.pdf. MR2579912(2011c:62219)

[25] Zhihua Zhang, Guang Dai, Donghui Wang, and Michael I. Jordan, BayesianGeneralized Kernel Models, Proceedings of the Thirteenth InternationalConference on Artificial Intelligence and Statistics (AISTATS-10), 2010,pp. 972-979, available at http://www.cs.berkeley.edu/~jordan/papers/

zhang-etal-aistats10.pdf.

Department of Mathematics, Louisiana State University, Baton Rouge,LA 70803, e-mail: [email protected]

Department of Mathematics, Louisiana State University, Baton Rouge,LA 70803, e-mail: [email protected]