Optmization Methods for Machine Learning Radial Basis function · Consider other approximation scheme based on Radial Basis functions (RBF) ˚(kx xjk) with j = 1;:::P. ˚: R+!R is

Interpolation RBF Regularized RBF Generalized RBF XOR problem

Optmization Methods for Machine LearningRadial Basis function

Laura Palagihttp://www.dis.uniroma1.it/∼palagi

Dipartimento di Ingegneria informatica automatica e gestionale A. RubertiSapienza Universita di Roma

Via Ariosto 25

RBF Networks L. Palagi


Interpolation problem

Given p distinct points in Rn:

X = {x i ∈ Rn, i = 1, . . . ,p},

and a corresponding set of real numbers

Y = {y i ∈ R, i = 1, . . . ,p}.

The interpolation problem consists in finding a functionf : Rn → R, in a given class of real functions F , which satisfies:

f (x i) = y i i = 1, . . . ,P. (1)



Interpolation propertiesFor n = 1 the Interpolation pb. can be solved explicitly usingpolynomials

f (x) =P−1∑i=0

ci t i

For n > 1, the 2-layer MLP with g not polynomial satisfies

P∑j=1

v jg(w j T x i − bj) = y i , i = 1, . . . ,P

for some w j ∈ Rn, and v j ,bj ∈ R.

MLP can approximate arbitrarily well a continuous functionprovided that an arbitrarily large number of units is available.



Interpolation properties

Being an universal approximator may be not enough fromtheoretical point of vew. An important property is the

existence of a best approximation

Informally: given a function f belonging to some set of functionsF and given a subset A of F find an element of A which isclosest to f . If d(f ,g) is the distance between two elements f ,gin F , we consider the problem

d∗A = infa∈A

d(f ,a)

If there exists a∗ ∈ A that attains the infimum, namelyd∗A = d(f ,a∗) then a∗ is the best approximation to f from A.



Best approximation properties

MLP does not have the best approximation property.

Consider other approximation scheme based on Radial Basisfunctions (RBF)

φ(‖x − x j‖)

with j = 1, . . .P.φ : R+ → R is a suitable continuous function, called radial basisfunction since it is assumed that the argument of φ is the radiusr = ‖x − x j‖.



Gaussian

φ(r) = e−(r/σ)2

with r > 0



Multiquadric

φ(r) = (r2 + σ2)1/2



Inverse Multiquadric

φ(r) = (r2 + σ2)−1/2



Other RBF

φ(r) = r linear splineφ(r) = r3 cubic splineφ(r) = r2 log r , thin plate spline.



Interpolation by RBF

Given p distinct points in Rn:

X = {x i ∈ Rn, i = 1, . . . ,P},

and consider functions of the form

f (x) =P∑

j=1

wjφ(‖x − x j‖), (2)

where the data points x j ∈ X are the so called centers and thecoefficients wj ∈ R are the weights.



Interpolation by RBFBy imposing the interpolation conditions we get:

P∑j=1

wjφ(‖x i − x j‖) = y i , i = 1, . . . ,P. (3)

It is a linear system of P equations in P unknowns. Let definethe vectors w =

(w1 · · · wP

)T, and y =

(y1 · · · yP

)T,

and the symmetric P × P matrix Φ with elements

Φi,j = φ(‖x i − x j‖), 1 ≤ i , j ≤ P,

system (3) cam be written as:

Φw = y .



Matrix Φ is non singular, provided that P ≥ 2, that theinterpolation points x j , j = 1, . . . ,P are distinct and using

I Gaussian (Φ positive definite)I the multiquadricI the inverse multiquadric (Φ positive definite)I linear spline

Thus, the interpolation problem Φw = y admits a uniquesolution. When φ pos. def. it can be computed by minimizingthe (strictly) convex quadratic function in RP

F (w) =12

wT Φw − yT w ,

whose gradient is given by ∇F (w) = Φw − y .



From Interpolation to approximation properties

Because of the remarkable properties of the RBFs, the RBFmethod is one of the most often applied approaches inmultivariable interpolation.

This has motivated the attempt of employing RBFs also withinapproximation algorithms for the solution of classification andregression problems in data mining.



Regularized RBF neural networks

Suppose that the set {(xp, yp), p = 1, . . . ,P} of data has beenobtained by random sampling of a function belonging to somespace of functions X in the presence of noise

This problem of recovering the function or an estimate of it fromthe set of data is clearly ill posed since it has an infinite numberof solutions.

In order to choose one particular solution we need to havesome a priori knowledge of the function that has to bereconstructed.

The most common form of a priori knowledge consists inassuming that the function is smooth in the sense that twosimilar inputs correspond to two similar outputs.



Regularized RBF neural networksThe solution can be obtained from a variational principle whichcontains both the data and smoothness information.

Smoothness is a measure of the ”oscillatory” behavior of f .Within a class of differentiable functions, one function is said tobe smoother than another one if it oscillates less. Asmoothness functional E2(f ) is defined and we consider

minfE(f ) = E1(f ) + λE2(f ) =

12

P∑i=1

[y i − f (x i)]2 + λE2(f ),

where the first term is enforcing closeness to the data and thesecond smoothness while the regularization parameter λ > 0controls the tradeoff be tween these two terms.



Regularized RBF neural networksIt can be shown that for a wide class of smoothness functionalsE2(f ), the solutions of the minimization all have the same form

P∑i=1

wiφ(‖x − c i‖) = y ,

Centers coincides with inputs

c i = x i , i = 1, . . . ,P

and weights solve the regularized system

(Φ + λI)w = y

whereΦ = {Φij}i,j=1,...,P = {φ(‖x i − x j‖)}i,j=1,...,P



2-layer Regularized RBF network

y(x)

-x t��

��3

��

��:

QQQQQQQQQQQs

φ(‖x − x1‖)

φ(‖x − x2‖)

φ(‖x − xP‖)

w1

w2

wP

XXXXXXXXXXz

��3

QQQQQQQQQQs n+ -uu



2-layer Regularized RBF network

I RBF are universal approximator: any continuous functioncan be approximated arbitrarily well on a compact set,provided a sufficiently large number of units, and for anappropriate choice of the parameters

I RBF possess the best approximation property, namelythere exists the best approximation and in most cases(under assumptions often satisfied) is unique (RBF islinear in parameters w)

I The value of λ can be selected by employing crossvalidation techniques and this may require that system(Φ + λI)w = y is solved several times.



2-layer Generalized RBF networkWhen P is very large, the cost of constructing a regularizedRBF network can be prohibitive. Indeed, the computation of theweights w ∈ RP requires the solution of a possible illconditioned linear system, which costs O(P3).

Generalized RBF neural network are used where the number Nof neural units is much less than P.

The output of the network can be defined by

y(x) =N∑

j=1

wjφj(‖x − cj‖), (4)

where both the centers cj ∈ Rn and the weights wj j = 1, . . . ,Nmust be selected appropriately.



2-layer Generalized RBF network

y(x)

-x t��

��3

��

��:

QQQQQQQQQQQs

φ(‖x − c1‖)

φ(‖x − c2‖)

φ(‖x − cN‖)

w1

w2

wN

XXXXXXXXXXz

��3

QQQQQQQQQQs n+ -uu



2-layer Generalized RBF networkI GRBF are universal approximator: any continuous function

can be approximated arbitrarily well on a compact set,provided a sufficiently large number of units, and for anappropriate choice of the parameters

I GRBF may NOT possess the best approximation property.However if the centers are fixed, the approximationproblem becomes linear with respect to w and theexistence of a best approximation is guaranteed

I in he general case, both the centers and the weights aretreated as variable parameters and the approximation isnonlinear

I As N << P, GRBF performs inherently a structuralstabilization which may prevent the occurrence ofovertraining.



An example: Exclusive OR

The logical function XOR

XORp x1 x2 yp

1 -1 -1 -12 -1 1 13 1 -1 14 1 1 -1

ccs

s1

2

3

4-

6x2

x1

Perceptron (linear separator) doesn’t work



Two layer MLP

w

w

-

-

x2

-��7x1 S

SSSSSSSw

-

sign(·)

sign(·)

w22

w12

w21

w11

b1

b2

v1

v2i+a2-

6

i+a1-

6QQQQQQs

��3

i+6

-sign(·)

-

wb3y(x)

w

w



Two layer MLP

Choose w11 = w22 = 1 and w12 = w21 = −1, b1 = b2 = −1,v1 = v2 = 1 b3 = 0.1 (output bias). We get

a1 = x1 − x2 − 1 z1 = sign(a1)a2 = −x1 + x2 − 1 z2 = sign(a2)y = sign(z1 + z2 + 0.1)

input p a1 a2 z1 z2 z1 + z2 + 0.1 y1 -1 -1 -1 -1 -1.9 -12 -3 1 -1 1 0.1 13 1 -3 1 -1 0.1 14 -1 -1 -1 -1 -1.9 -1



Two layer MLP

This MLP network with two hidden nodes realizes a nonlinearseparation (each hidden node describes one of the two lines).The output node combines the outputs of the two hidden layer.

eeu

u1

2

3

4

-

@@@@@@@

@@@@@@@

6x2

x1



RBF network

Consider a RBF network with two units (N = 2) with centersc1, c2 and assume the activation function is a gaussiangj = e−(‖x−cj‖/σ)2

w

w

-

-

x2

-��7x1S

SSSSSSSw

-

w1

w2

z2 = e−‖x−c2‖

2

σ2��3

z1 = e−‖x−c1‖

2

σ2 QQQQQQs i+6

-sign(·)

-

wb y(x)



RBF network

Choose σ =√

2 and c1 =

(11

)c2 =

(−1−1

)We transform

the problem into a linearly separable form.

XOR

p e−‖x−c1‖

2

σ2 e−‖x−c1‖

2

σ2 yp

1 e−4 1 -12 e−2 e−2 13 e−2 e−2 14 1 e−4 -1

fv

v1

423

@@@@@@@@@

-

6z2

z1



The output takes the form

y(x) = w1e−‖x−c1‖

2

σ2 + w2e−‖x−c2‖

2

σ2 + b

Minimizing the training error

minw ,b

4∑p=1

(y(xp)− yp)2

we get the optimal solution (w∗,b∗) that gives E = 0 w1w2b

=

−2,675065656−2,675065656

1,72406123

and the RBF network has been trained.


Optmization Methods for Machine Learning Radial Basis function · Consider other approximation scheme based on Radial Basis functions (RBF) ˚(kx xjk) with j = 1;:::P. ˚: R+!R is

Documents