Top Banner
27

Avances Base Radial

Jun 27, 2015

Download

Education

Sarai González

Recent Advances in
Radial Basis Function
Networks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Avances Base Radial

Recent Advances in

Radial Basis Function

Networks

Mark J� L� Orr�

Institute for Adaptive and Neural Computation

Division of Informatics� Edinburgh University

Edinburgh EH� �LW� Scotland� UK

June ��� ����

Abstract

In ���� an Introduction to Radial Basis Function Networks was published on

the web� along with a package of Matlab functions�� The emphasis was on the

linear character of RBF networks and two techniques borrowed from statistics�

forward selection and ridge regression�

This document� is an update on developments between ���� and ���� and

is associated with a second version of the Matlab package�� Improvements have

been made to the forward selection and ridge regression methods and a new

method� which is a cross between regression trees and RBF networks� has been

developed�

� mjo�anc�ed�ac�uk� www�anc�ed�ac�uk��mjo�papers�intro�ps� www�anc�ed�ac�uk��mjo�software�rbf�zip� www�anc�ed�ac�uk��mjo�papers�recad�ps� www�anc�ed�ac�uk��mjo�software�rbf��zip

Page 2: Avances Base Radial

� CONTENTS

Contents

� Introduction �

��� MacKay�s Hermite Polynomial � � � � � � � � � � � � � � � � � � � � � �

��� Friedman�s Simulated Circuit � � � � � � � � � � � � � � � � � � � � � � �

� Maximum Marginal Likelihood �

��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� The EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� The DM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� Optimising the Size of RBFs �

��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

��� E�cient Re�estimation of � � � � � � � � � � � � � � � � � � � � � � � � �

��� Avoiding Local Minima � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� The Optimal RBF Size � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� Trial Values in Other Contexts � � � � � � � � � � � � � � � � � � � � � ��

��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Regression Trees and RBF Networks ��

��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� The Basic Idea � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

��� Generating the Regression Tree � � � � � � � � � � � � � � � � � � � � � ��

��� From Hyperrectangles to RBFs � � � � � � � � � � � � � � � � � � � � � ��

��� Selecting the Subset of RBFs � � � � � � � � � � � � � � � � � � � � � � ��

��� The Best Parameter Values � � � � � � � � � � � � � � � � � � � � � � � �

��� Demonstrations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Appendix ��

A Applying the EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � ��

B The Eigensystem of HH� � � � � � � � � � � � � � � � � � � � � � � � � ��

Page 3: Avances Base Radial

Introduction �

� Introduction

In �� an introduction to radial basis function �RBF� networks was published on theweb ���� along with an associated Matlab software package ����� The approach takenstressed the linear character of RBF networks� which traditionally have only a singlehidden layer� and borrowed techniques from statistics� such as forward selection andridge regression� as strategies for controlling model complexity� the main challengefacing all methods of nonparametric regression�

That was three years ago� Since then� some improvements have been made� anew algorithm devised and the package of Matlab functions is now in its secondversion ���� This document describes the theory of the new developments and willbe of interest to practitioners using the new software package and theorists enhancingexisting methods or developing new ones�

Section � describes what happens when the expectation�maximisation algorithmis applied to RBF networks� Section � describes a simple procedure for optimisingthe RBF widths� particularly for ridge regression� Finally� section � describes thenew algorithm which uses a regression tree to generate the centres and sizes of aset of candidate RBFs and to help select a subset of these for the network� Twosimulated data sets� used for demonstration� are described below�

��� MacKay�s Hermite Polynomial

The �rst data set is from �� � and is based on a one�dimensional Hermite polynomial

y � � � ��� x� �x�� e�x�

� input values are sampled randomly between �� � x � � and Gaussian noise ofstandard deviation � � �� is added to the outputs ��gure �����

−4 −2 0 2 40

1

2

3

x

y

training set actual function

Figure ���� Sample Hermite data �stars� and the actual function �curve��

Page 4: Avances Base Radial

� Introduction

��� Friedman�s Simulated Circuit

This second data set simulates an alternating current circuit with four parameters�resistance �R ohms�� angular frequency �� radians per second�� inductance �L hen�ries� and capacitance �C farads� in the ranges

� R � � �

� � � � � �� � �

� L � � �

�� � �� � C � ��� � �� �

� random samples of the four parameters in these ranges were used to generatecorresponding values of the impedance�

Z �

qR� � �� L� ��� C�� �

to which Gaussian noise of standard deviation � � ��� was added� This resultedin a training set of � cases with four�dimensional inputs x � �R � L C�� and ascalar output y � Z� The problem originates from ���� Before applying any learningalgorithms to this data� the original inputs� with their very di�erent dynamic ranges�are rescaled to the range ���� �� in each component�

Page 5: Avances Base Radial

Maximum Marginal Likelihood �

� Maximum Marginal Likelihood

��� Introduction

The expectation�maximisation �EM� algorithm ��� �� performs maximum likelihoodestimation for problems in which some of the variables are unobserved� Recentlyit has been successfully applied to density estimation ��� and probabilistic principlecomponents ���� ���� for example� This section discusses the application of EM toRBF networks�

First we review the probability model of a linear neural network and come upwith an expression for the marginal likelihood of the data� It is this likelihood whichwe ultimately want to maximise� Then we show the results of applying the EMalgorithm� a pair of re�estimation formulae for the model parameters� However� itturns out that a similar set of re�estimation formula can be derived by a simplermethod and also that they converge more rapidly than the EM versions� Finally� wedraw some conclusions�

��� Review

The model estimated by a linear neural network from noisy samples f�xi� yi�gpi��can be written

f�x� �mXj��

wj hj�x� � �����

where the fhjgmj�� are �xed basis functions and fwjgmj�� are unknown weights �to beestimated�� The vector of residual errors between model and data is

e � y �Hw �

whereH is the design matrix and has elements Hij � hj�xi�� In a Bayesian approachto analysing the estimation process� the a priori probability of the weights w canbe modelled as a Gaussian of variance ���

p�w� � ��m exp

��w

�w

� ��

�� �����

The conditional probability of the data y given the weights w can also be modelledas a Gaussian� with variance ��� to account for the noise included in the outputs ofthe training set� fyigpi���

p�yjw� � ��p exp

��e

�e

���

�� �����

The joint probability of data and weights is the product of p�w� with p�yjw� andcan be represented as an equivalent cost function by taking logarithms� multiplyingby �� and dropping constant terms to obtain

E�y� w� � p ln �� �m ln �� �e�e

���w�w

��� �����

Page 6: Avances Base Radial

� Maximum Marginal Likelihood

The conditional probability of the weights w given the data y is found using Bayesrule� again involves the product of ����� with ����� and is another Gaussian�

p�wjy� � p�yjw� p�w�

� jWj���� exp���

��w� �w��W���w � �w�

�� �����

where

�w � A��H�y �

W � ��A�� �

A � H�H� � Im �

� ���

��� �����

Finally� the marginal likelihood of the data is

p�y� �

Zp�yjw� p�w� dw

� ��p jPj��� exp��y

�Py

���

�� �����

where

P � Ip �HA��H� �

Note that there is an equivalent cost function for p�y� which is obtained by takinglogarithms� multiplying by �� and dropping the constant terms�

E�y� � p ln�� � ln jPj� y�Py

��� ����

��� The EM Algorithm

The EM algorithm estimates the parameters of a model iteratively� starting fromsome initial guess� Each iteration consists of an expectation �E� step which �ndsthe distribution for the unobserved variables and a maximisation �M� step whichre�estimates the parameters of the model to be those with the maximum likelihoodfor the observed and missing data combined�

In the context of a linear neural network it is possible to consider the trainingset f�xi� yi�gpi�� as the observed data� the weights fwjgmj�� as the missing data and

the variance of the noise �� and the a priori variance of the weights �� as the modelparameters�

In the E�step� the expectation of the conditional probability of the missing data����� is taken and substituted� in the M�step� into the joint probability of the com�bined data� or its equivalent cost function ������ which is then optimised with respectto the model parameters �� and ��� These two steps are guaranteed to increase themarginal probability of the observed data and when iterated convergence to a localmaximum�

Page 7: Avances Base Radial

��� The DM Algorithm �

Detailed analysis �see appendix A� results in a pair of re�estimation formulae forthe parameters �� and ���

�� ��e��e� ��

p� ����

�� ��w� �w � �m� � ��

m� ���� �

where

�e � y �H �w �

� m� � trA�� �

Initial guesses are substituted into the right hand sides which produce new guesses�The process is repeated until a local minimum of ���� is reached�

Note that equation ���� � was derived in ���� by a free energy approach� It hasbeen shown that free energy and the EM algorithm are intimately connected �����

Figure ��� illustrates with the Hermite data described in section ���� Centres ofradius r � � were created for each training set input� The �gure plots logarithmiccontours of ���� and the sequence of �� and �� values re�estimated by ���� ��� ��

−4 −2 0−4

−2

0

2

log(σ2)

log(ς2 )

Figure ���� Optimisation of �� and �� by EM�

��� The DM Algorithm

An alternative approach to minimising ����� is simply to di�erentiate it and set theresults to zero� This is easily done and results in the pair of re�estimation formulae

�� ��e��e

p� � ������

�� ��w��w

� ������

Page 8: Avances Base Radial

Maximum Marginal Likelihood

I call this method the �DM algorithm� after David MacKay who �rst derivedthese equations �� �� Its disadvantage is the absence of any guarantee that theiterations converge� unlike their EM counterparts ���� ��� � which are known toincrease the marginal likelihood �or leave it the same if a �xed point has beenreached�� Any �xed point of DM is also a �xed point of EM� and vice versa� but ifthere are multiple �xed points there is no guarantee that both methods will convergeto the same one� even when starting from the same guess�

Figure ��� plots the sequence of re�estimated values using ������ ����� for thesame training set� RBF network and initial values of �� and �� used for �gure ����It is apparent that convergence is faster for DM than for EM in this example� taking� iterations for DM compared to � for EM� In fact� our empirical observation isthat DM always converges considerably faster than EM if they start from the sameguess and converge to the same local minimum� Furthermore� DM has never failedto converge�

−4 −2 0−4

−2

0

2

log(σ2)

log(ς2 )

Figure ���� Optimisation of �� and �� by DM�

��� Conclusions

We started by applying the EM algorithm to RBF networks using the weight�decay�ridge regression� style of penalised likelihood and ended with a pair of re�estimationformulae for the noise variance �� and prior weight variance ��� However� theseturned out to be less e�cient than a similar pair of formulae which had been knownin the literature for some time�

The rbf rr � method in the Matlab software package ��� has an option to usemaximum marginal likelihood �MML� as the model selection criterion �instead ofGCV or BIC� for example�� When this option is selected the regularisation parameter����� is re�estimated using� by default� the DM equations ������ ������ Another optioncan be set so that the EM versions ���� ��� � are used instead�

Page 9: Avances Base Radial

Optimising the Size of RBFs �

� Optimising the Size of RBFs

��� Introduction

In previous work ���� ��� we concentrated on methods for optimising the regular�isation parameter� �� of an RBF network� However� another key parameter is thesize of the RBFs and until now no methods have been provided for its optimisation�This section describes a simple scheme to �nd an overall scale size for the RBFs ina network�

We �rst review the basic concepts already covered elsewhere ���� and then de�scribe an improved version of the re�estimation formula for the regularisation pa�rameter which is considerably more e�cient and allows multiple initial guesses for� to be optimised in an e�ort to avoid getting trapped in local minima �the detailsare given in appendix B�� We then describe a method for choosing the best overallsize for the RBFs from a number of trial values which is rendered tractable by thee�cient optimisation of �� We then make some concluding remarks�

��� Review

In a linear model with �xed basis functions fhjgmj�� and weights fwjgmj���

f�x� �mXj��

wj hj�x� � �����

the model complexity can be controlled by the addition of a penalty term to the sumof squared errors over the training set� f�xi� yi�gpi��� When this combined error�

E �

pXi��

�yi � f�xi��� � �

mXj��

w�j �

is optimised� large components in the weight vector w are inhibited� This kind ofpenalty is known as ridge regression or weight�decay and the parameter �� whichcontrols the amount of penalty� is known as the regularisation parameter� While thenominal number of free parameters is m �the weights�� the e�ective number is less�due to the penalty term� and is given ���� by

� m� � trA�� � �����

A � H�H� � Im � �����

where H is the design matrix with elements Hij � hj�xi�� The expression for is monotonic in � so model complexity can be decreased �or increased� by raising�or lowering� the value of ��

The parameter � has a Bayesian interpretation� it�s the ratio of ��� the noisecorrupting the training set outputs� to ��� the a priori variance of the weights �seesection ��� If the value of � is known then the optimal weight is

�w � A��H�y � �����

Page 10: Avances Base Radial

� Optimising the Size of RBFs

However� neither �� nor �� may be available in a practical situation so it is usuallynecessary to establish an e�ective value for � in parallel with optimising the weights�This may be done with model selection criterion such as BIC �Bayesian informa�tion criterion�� GCV �generalised cross�validation� or MML �maximum marginalisedlikelihood� see section �� and in particular with one or more re�estimation formula�For GCV the single formula is

� �

p�

�e��e

�w�A�� �w� �����

where

�e � y �H �w �

� tr�A�� � �A��

��

An initial guess for � is used to evaluate the right hand side of ����� which producesa new guess� The resulting sequence of re�estimated values converge to a localminimum of GCV� Each iteration requires the inverse of the m�by�m matrix A andtherefore costs of order m� �oating point operations�

��� E�cient Re�estimation of �

The optimisation of � by iteration of the re�estimation formula is burdened by thenecessity of having to compute an expensive matrix inverse every iteration� How�ever� by a reformulation of the individual terms of the equation using the eigen�values and eigenvectors of HH� it is possible to perform most of the work duringthe �rst iteration and reuse the results for subsequent ones� Thus the amount ofcomputation required to complete an optimisation which takes q steps to convergeis reduced by almost a factor of ��q� Unfortunately� the technique only works for asingle global regularisation parameter ����� not for multiple parameters applying todi�erent groups of weights or to individual weights �����

Suppose the eigenvalues and eigenvectors of HH� are f�igpi�� and fuigpi�� andthat the projections of y onto the eigenvectors are �yi � y�ui� Then� as shown inappendix B� the four terms involved in the re�estimation formula ����� are

pXi��

�i � �� �����

p� �

pXi��

�i��i � ���

� �����

�e��e �

pXi��

�i �y�i

��i � ���� ����

�w�A�� �w �

pXi��

���y�i��i � ���

� ����

If � is re�estimated by computing ��������� instead of explicitly calculating theinverse in ������ then the computational cost of each iteration is only of order p�

Page 11: Avances Base Radial

��� Avoiding Local Minima ��

instead of m�� The overhead of initially calculating the eigensystem� which is oforder p�� has to be taken into account but is only performed once� For problems inwhich p is not much bigger thanm this represents a signi�cant saving in computationtime and makes it feasible to optimise multiple guesses for the initial value of � todecrease the chances of getting caught in a local minimum�

��� Avoiding Local Minima

If the initial guess for � is close to a local minimum of GCV �or whatever modelselection criterion is employed� then re�estimation using ����� is likely to get trapped�We illustrate by using Friedman�s data set as described in section ��� with an RBFnetwork of � Gaussian centres coincident with the inputs of the training set andof �xed radius r � �

The solid curve in �gure ��� shows the variation of GCV with �� The open circlesshow a sequence of re�estimated � values with their corresponding GCV scores� Theinitial guess was � � � �� and the sequence converged near � � � �� �shown by theclosed circle� at a local minimum� The global minimum near � � � ��� was missed�

−12 −10 −8 −6 −4 −2 04.7

4.8

4.9

5

5.1

5.2

5.3

log(λ)

log(

GC

V)

Figure ���� The variation of GCV with � for Friedman�s problem and a sequence ofre�estimations starting at � � � ���

Compare �gure ��� with �gure ��� where the only change was to use a di�erentguess� � ��� for the initial value of �� This time the guess is su�ciently close tothe global minimum that the re�estimations are attracted towards it� Note thatthe set of eigenvalues and eigenvectors used to compute the sequences in �gures ���and ��� are identical� Since the calculation of the eigensystem dominates the othercomputational costs it is almost as expensive to optimise one trial value as it is tooptimise several� Thus to avoid falling into a local minimum several trial valuesspread over a wide range can be optimised and the solution with the lowest GCVselected as the overall winner� This value can then be used to determine the weights����� ����� and ultimately the predictions ������ of the network�

Page 12: Avances Base Radial

�� Optimising the Size of RBFs

−12 −10 −8 −6 −4 −2 04.7

4.8

4.9

5

5.1

5.2

5.3

log(λ)

log(

GC

V)

Figure ���� Same as �gure ��� except the initial guess is � � � ���

��� The Optimal RBF Size

For Gaussian radial functions of �xed width r the transfer functions of the hiddenunits are

hj�x� � exp

�� �x� cj�

��x� cj�

r�

��

Unfortunately� there is no re�estimation formula for r� as there is for �� even inthis simple case where the same scale is used for each RBF and each component ofthe input ���� To properly optimise the value of r would thus require the use of anonlinear optimisation algorithm and would have to incorporate the optimisation of� �since the optimal value of � changes as r changes��

An alternative� if rather crude approach� is to test a number of trial values for r�For each value an optimal � is calculated �by using the re�estimation method above�and the model selection score noted� When all the values have been checked� theone associated with the lowest score wins� The computational cost of this procedureis dominated� once again� by the cost of computing the eigenvalues and eigenvectorsof HH�� and these have to be calculated separately for each value of r�

While this procedure is less computationally demanding than a full nonlinearoptimisation of r and � it�s drawback is that it is only capable of identifying thebest value for r from a �nite number of alternatives� On the other hand� given thatthe value of � is fully optimised and that the model selection criteria are heuristic�in other words� approximate� in nature� it is arguable that a more precise locationfor the optimal value of r is unlikely to have much practical signi�cance�

We illustrate the method on the Hermite data as described in section ���� Onceagain we use each training set input as an RBF centre� We tried seven di�erent trialvalues for r� ��� ��� �� �� � ���� ��� and ���� For each trial value we plot� in �gure���� the variation of GCV with � �the curves�� as well as the optimal � �the closed

Page 13: Avances Base Radial

��� Trial Values in Other Contexts ��

−10 −8 −6 −4 −2 0−1.9

−1.8

−1.7

−1.6

−1.5

log(λ)

log(

GC

V)

0.40.60.81.01.21.41.6

Figure ���� The Hermite data set with four sizes of RBFs�

circles� found by re�estimation as described above� The radius value which led tothe lowest GCV score was r � �� and the corresponding optimal regularisationparameter is � � ���

Initially� as r increases from its lowest value� the GCV score at the optimum �decreases� Eventually it reaches its lowest value at r � �� � Above that there is notmuch increase in optimised GCV� although the optimal � decreases rapidly�

�� Trial Values in Other Contexts

The use of trial values is limited to cases where there is a small number of parametersto optimise� such as the single parameter r� If there are several parameters withtrial values then the number of di�erent combinations to evaluate can easily becomeprohibitively large� In RBF networks where there is a separate scale parameterfor each dimension� so that the transfer functions are� for example in the case ofGaussians�

hj�x� � exp

��

nXk��

�xk � cjk��

r�jk

��

there would be tmn combinations to check where t is the number of trial valuesfor each rjk� m is the number of basis functions and n the number of dimensions�However� it is possible to test trial values for an overall scale size � if some othermechanism can be used to generate the scales rjk� Here� the transfer functions are

hj�x� � exp

��

nXk��

�xk � cjk��

�� r�jk

��

This is the approach taken for the method in section � where a regression treedetermines the values of rjk but the overall scale size � is optimised by testing trialvalues�

Page 14: Avances Base Radial

�� Optimising the Size of RBFs

�� Conclusions

We�ve shown how trial values for the overall size of the RBFs can be compared usinga model selection criterion� In the case of ridge regression� an e�cient method foroptimising the regularisation parameter helps reduce the computational burden oftraining a separate network for each trial value� However� the same technique canalso be used with other methods of complexity control� including those in whichthere is no regularisation�

In the Matlab software package ��� each method can be con�gured with a set oftrial values for the overall RBF scale� The best value is chosen and used to generatethe RBF network which the Matlab function returns�

Page 15: Avances Base Radial

Regression Trees and RBF Networks ��

� Regression Trees and RBF Networks

��� Introduction

This section is about a novel method for nonparametric regression involving a combi�nation between regression trees and RBF networks ��� The basic idea of a regressiontree is to recursively partition the input space in two and approximate the functionin each half by the average output value of the samples it contains ���� Each splitis parallel to one of the axes so it can be expressed by an inequality involving oneof the input components �e�g� xk b�� The input space is thus divided into hy�perrectangles organised into a binary tree where each branch is determined by thedimension �k� and boundary �b� which together minimise the residual error betweenmodel and data�

A bene�t of regression trees is the information provided in the split statisticsabout the relevance of each input variable� The components which carry the mostinformation about the output tend to be split earliest and most often� A weaknessof regression trees is the discontinuous model caused by the output value jumpingacross the boundary between two hyperrectangles� There is also the problem ofdeciding when to stop growing the tree �or equivalently� how much to prune after ithas fully grown� which is the familiar bias�variance dilemma faced by all methods ofnonparametric regression ���� The use of radial basis functions in conjunction withregression trees can help to solve both these problems�

Below we outline the basic method of combining RBFs and regression trees asit appeared originally and describe our version of this idea and why we think it isan improvement� Finally we show some results and summarise our conclusions�

��� The Basic Idea

The combination of trees and RBF networks was �rst suggested by �� in the con�text of classi�cation rather than regression �though the two cases are very similar��Further elaboration of the idea appeared in ��� Essentially� each terminal node ofthe classi�cation tree contributes one hidden unit to the RBF network� the centreand radius of which are determined by the position and size of the correspondinghyperrectangle� Thus the tree sets the number� positions and sizes of all RBFs inthe network� Model complexity is controlled by two parameters� �c determines theamount of tree pruning in C��� �� � �the software package used by �� to generateclassi�cation trees� and � �xes the size of RBFs relative to hyperrectangles�

Our major reservation about the approach taken by �� is the treatment of modelcomplexity� In the case of the scaling parameter ���� the author claimed it had littlee�ect on prediction accuracy� but this is not in accord with our previous experienceof RBF networks� As for the amount of pruning ��c�� he demonstrated its e�ecton prediction accuracy yet used a �xed value in his benchmark tests� Moreover�there was no discussion of how to control scaling and pruning to optimise modelcomplexity for a given data set�

Page 16: Avances Base Radial

�� Regression Trees and RBF Networks

Our method is a variation on Kubat�s with the following alterations�

�� We address the model complexity issue by using the nodes of the regressiontree not to �x the RBF network but rather to generate a set of RBFs fromwhich the �nal network can be selected� Thus the burden of controlling modelcomplexity shifts from tree generation to RBF selection�

�� The regression tree from which the RBFs are produced can also be used toorder selections such that certain candidate RBFs are allowed to enter themodel before others� We describe one way to achieve such an ordering anddemonstrate that it produces more accurate models than plain forward selec�tion�

�� We show that� contrary to the conclusions of ��� the method is typically quitesensitive to the parameter � and discuss its optimisation by the use of multipletrial values�

��� Generating the Regression Tree

The �rst stage of our method �and Kubat�s� is to generate a regression tree� Theroot node of the tree is the smallest hyperrectangle which contains all the trainingset inputs� fxigpi��� Its size sk �the half�width� and centre ck in each dimension kare

sk ��

�maxi�S

�xik��mini�S

�xik�

��

ck ��

�maxi�S

�xik� � mini�S

�xik�

��

where S � f�� � � � � � pg is the set of training set indices� A split of the root nodedivides the training samples into left and right subsets� SL and SR� on either side ofa boundary b in one of the dimensions k such that

SL � fi � xik � bg �SR � fi � xik bg �

The mean output value on either side of the split is

�yL ��

pL

Xi�SL

yi �

�yR ��

pR

Xi�SR

yi �

where pL and pR are the number of samples in each subset� The residual squareerror between model and data is then

E�k� b� ��

p

��Xi�SL

�yi � �yL�� �

Xi�SR

�yi � �yR��

�A �

Page 17: Avances Base Radial

��� From Hyperrectangles to RBFs ��

The split which minimisesE�k� b� over all possible choices of k and b is used to createthe children of the root node and is easily found by discrete search over n dimensionsand p cases� The children of the root node are split recursively in the same mannerand the process terminates when a node cannot be split without creating a childcontaining less samples than a given minimum� pmin� which is a parameter of themethod� Compared to their parent nodes� the child centres will be shifted and theirsizes reduced in the k�th dimension�

Since the size of the regression tree does not determine the model complexity�there is no need to perform the �nal pruning step normally associated with recursivesplitting methods ��� � � ��

��� From Hyperrectangles to RBFs

The regression tree contains a root node� some nonterminal nodes �having children�and some terminal nodes �having no children�� Each node is associated with ahyperrectangle of input space having a centre c and size s as described above� Thenode corresponding to the largest hyperrectangle is the root node and the nodesizes decrease down the tree as they are divided into smaller and smaller pieces�To translate a hyperrectangle into a Gaussian RBF we use its centre c as the RBFcentre and its size s scaled by a parameter � as the RBF radius� r � � s� Thescalar � has the same value for all nodes and is another parameter of the method�in addition to pmin�� Our � is not quite the same as Kubat�s � �they�re related byan inverse and a factor of

p�� but plays exactly the same role�

��� Selecting the Subset of RBFs

After the tree nodes are translated into RBFs the next step of our method is toselect a subset of them for inclusion in the model� This is in contrast to the methodof �� where all RBFs from terminal nodes were included in the model which wasthus heavily dependent on the extent of tree pruning to control model complexity�Selection can be performed using either a standard method such as forward selection���� ��� or in a novel way� by employing the tree to guide the order in which candidateRBFs are considered�

In the standard methods for subset selection the RBFs generated from the regres�sion tree are treated as an unstructured collection with no distinction between RBFscorresponding to di�erent nodes in the tree� However� intuition suggests that thebest order to consider RBFs for inclusion in the model is large ones �rst and smallones last� to synthesise coarse structure before �ne details� This� in turn� suggestssearching for RBF candidates by traversing the tree from the largest hyperrectangle�and RBF� at the root to the smallest hyperrectangles �and RBFs� at the terminalnodes� Thus the �rst decision should be whether to include the root node in themodel� the second whether to include any of the children of the root node� and soon� until the terminal nodes are reached�

The scheme we eventually developed for selecting RBFs goes somewhat beyondthis simple picture and was in�uenced by two other considerations� The �rst con�cerns a classic problem with forward selection� namely� that one regressor can blockthe selection of other more explanatory regressors which would have been chosen in

Page 18: Avances Base Radial

� Regression Trees and RBF Networks

preference had they been considered �rst� In our case there was a danger that aparent RBF could block its own children� To avoid this situation� when consider�ing whether to add the children of a node which had already been selected we alsoconsidered the e�ect of deleting the parent� Thus our method has a measure of back�ward elimination as well as forward selection� This is reminiscent of the selectionschemes developed for the MARS ��� and MAPS ��� algorithms�

A second reason for departing from a simple breadth��rst search is because thesize of a hyperrectangle �in terms of volume� on one level is not guaranteed to besmaller than the size of all the hyperrectangles in the level above �only its parent�so it not easy to achieve a strict largest�to�smallest ordering� In view of this� weabandoned any attempt to achieve a strict ordering and instead devised a searchalgorithm which dynamically adjusts the set of selectable RBFs by replacing selectedRBFs with their children�

The algorithm depends on the concept of an active list of nodes� At any givenmoment during the selection process only these nodes and their children are con�sidered for inclusion or exclusion from the model� Every time RBFs are added orsubtracted from the model the active list expands by having a node replaced by itschildren� Eventually the active list becomes coincident with the terminal nodes andthe search is terminated� In detail� the steps of the algorithm are as follows�

�� Initialise the active list with the root node and the model with the root node�sRBF�

�� For all nonterminal nodes on the active list consider the e�ect �on the modelselection criterion� of adding both or just one of the children�s RBFs �threepossible modi�cations to the model�� If the parent�s RBF is already in themodel� also consider the e�ect of �rst removing it before adding one or bothchildren�s RBFs or of just removing it �a further four possible modi�cations��

�� The total number of possible adjustments to the model is somewhere betweenthree and seven times the number of active nonterminal nodes� depending onhow many of their RBFs are already in the model� From all these possibilitieschoose the one which most decreases the model selection criterion� Update thecurrent model and remove the node involved from the active list� replacing itwith its children� If none of the modi�cations decrease the selection criterionthen chose one of the active nodes at random and replace it by its children butleave the model unaltered�

�� Return to step � and repeat until all the active nodes are terminal nodes�

Once the selection process has terminated the network weights can be calculatedin the usual way by solving the normal equation�

w �H�H

��H�y �

where H is the design matrix� There is no need for a regularisation term� as ap�pears in equations ����� ���� for example� because model complexity is limited by theselection process�

Page 19: Avances Base Radial

��� The Best Parameter Values ��

�� The Best Parameter Values

Our method has three main parameters� the model selection criterion� pmin whichcontrols the depth of the regression tree and � which determines the relative sizebetween hyperrectangles and RBFs�

For the model selection criterion we found that the more conservative BIC� whichtends to produce more parsimonious models� rarely performed worse than GCVand often did signi�cantly better� This is in line with the experiences of otherpractitioners of algorithms based on subset selection such as ���� who modi�ed GCVto make it more conservative� and ���� who also found BIC gave better results thanGCV�

For pmin and � we use the simple method of comparing the model selectionscores of a number of trial values� as for the RBF widths in section �� This meansgrowing several trees �one for each trial value of pmin� and then �for each tree�selecting models from several sets of RBFs �one for each value of ��� The cost isextra computation� the more trial values there are� the longer the algorithm takesto search through them� However� the basic algorithm is not unduly expensive andif the number of trial values is kept fairly low �about � or less alternatives for eachparameter�� the computation time is acceptable�

�� Demonstrations

Figure ��� shows the prediction of a pure regression tree for a sample of Hermitedata �section ����� For clarity� the samples themselves are not shown� just thetarget function and the prediction� Of course� the model is discontinuous and eachhorizontal section corresponds to one terminal node in the tree�

−4 −2 0 2 40

1

2

3

x

y

target prediction

Figure ���� A pure regression tree prediction on the Hermite data�

Page 20: Avances Base Radial

� Regression Trees and RBF Networks

The tree which produced the prediction shown in �gure ��� was grown untilfurther splitting would have violated the minimum number of samples allowed pernode �pmin�� There was no pruning or any other sophisticated form of complexitycontrol� so this kind of tree is not suitable for practical use as a prediction method�However� in our method the tree is only used to create RBF candidates� Modelcomplexity is controlled by a separate process which selects a subset of RBFs forthe network�

Figure ��� shows the predictions of the combined method on the same data setused in �gure ��� after a subset of RBFs were selected from the pool of candidatesgenerated by the tree nodes� Now the model is continuous and its complexity is wellmatched to the data�

−4 −2 0 2 40

1

2

3

x

y

target prediction

Figure ���� The combined method on the Hermite data�

As a last demonstration we turn our attention to Friedman�s data set �sec�tion ����� In experiments with the MARS algorithm ���� Friedman estimated theaccuracy of his method by replicating data sets � times and computing the meanand standard deviation of the scaled sum�square�error� For this data set his best re�sults� corresponding to the most favourable values for the parameters of the MARSmethod� were ���� � ��

To compare our algorithm with MARS� and also to test the e�ect of using mul�tiple trial values for our method�s parameters� pmin and �� we conducted a similarexperiment� Before we started� we tried some di�erent settings for the trial valuesand identi�ed one which gave good results on test data� Then� for each of the � replications� we applied the method twice� In the �rst run we used the trial valueswe had discovered earlier� In the second run we used only a single �best� valuefor each parameter� the average of the trial values� forcing this value to be used forevery replicated data set� The results are shown in table ��

It is apparent that the results are practically identical to MARS when the fullsets of trial values are used but signi�cantly inferior when only single �best� valuesare used�

Page 21: Avances Base Radial

�� Conclusions ��

pmin � error

�� �� � �� �� � � � ���� � �

� ��� � �

Table �� Results on � replications of Friedman�s data set�

In another test using replicated data sets we compared the two alternative meth�ods of selecting the RBFs from the candidates generated by the tree� standard for�ward selection or the method described above in section ��� which uses the tree toguide the order in which candidates are considered� This was the only di�erencebetween the two runs� the model parameters were the same as in the �rst row oftable �� The performance of tree�guided selection was ��� � � � �as in table ���but forward selection was signi�cantly worse� �� � �� �

��� Conclusions

We have described a method for nonparametric regression based on combining re�gression trees and radial basis function networks� The method is similar to �� andhas the same advantages �a continuous model and automatic relevance determina�tion� but also some signi�cant improvements� The main enhancement is the additionof an automatic method for the control of model complexity through the selectionof RBFs� We have also developed a novel procedure for selecting the RBFs basedon the structure of the tree�

We�ve presented evidence that the method is comparable in performance to thewell known MARS algorithm and that some of its novel features �trial parametervalues� tree�guided selection� are actually bene�cial� More detailed evaluations withDELVE ���� data sets are in preparation and preliminary results support these con�clusions�

The Matlab software package ��� has two implementations of the method� Onefunction� rbf rt �� uses tree�guided selection� while the other� rbf rt �� uses for�ward selection� The operation of each function is described� with examples� in acomprehensive manual�

Page 22: Avances Base Radial

�� Appendix

� Appendix

A Applying the EM Algorithm

We want to maximise the marginal probability of the observed data ����� by substi�tuting expectations of the conditional probability of the unobserved data ����� intothe cost function for the joint probability of the combined data ����� and minimis�ing this with respect to the parameters �� �the noise variance� and �� �the a priori

weight variance��

From ������ hwi � �w and h�w� �w� �w� �w��i �W � ��A��� The expectationof w�w is then

hw�wi � tr hww�i� �w� �w� tr hww� � �w �w�i� �w� �w� tr h�w � �w� ��w � �w��i� �w� �w� ��trA��

� �w� �w� ���m� � � �A���

The last step follows from � m � � trA�� �the e�ective number of parameters�and � � ����� �the regularisation parameter�� Similarly�

he�ei � tr he e�i� �e��e� tr he e� � �e�e�i� �e��e� trH hww� � �w �w�iH�

� �e��e� ��trHA��H�

� �e��e� �� � �A���

since e � y � Hw is linear in w and trHA��H� is another expression for thee�ective number of parameters �

Equations �A��� A��� summarise the expectation of the conditional probabilityfor w and can be substituted into the joint probability of the combined data or theequivalent cost function ����� so that the resulting expression can be optimised withrespect to �� and ��� Note that in �A��� A��� these parameters are held constant attheir old values� only the explicit occurrences of �� and �� in ����� are varied in theoptimisation�

After di�erentiating ����� with respect to �� and ��� equating the results to zeroand �nally substituting the expectations �A��� A��� we get the re�estimation formulae

�� ��e��e� ��

p�

�� ��w� �w � �m� � ��

m�

Page 23: Avances Base Radial

B The Eigensystem of HH� ��

B The Eigensystem of HH�

We want to derive expressions for each of the terms in ����� using the eigenvaluesand eigenvectors of HH� We start with a singular value decomposition of thedesign matrix� H � USV�� where U � �u� u� � � � up� � Rp�p and V � Rm�m areorthogonal and S � Rp�m�

S �

�����������

p�� � � �

p�� � � �

������

� � ����

� � � p�m

� � � ���

���� � �

��� � � �

�������������

contains the singular values� fp�igpi��� Note that� due to the orthogonality of V�

HH� � USS�U�

pXi��

�i ui u�i �

so the �i are the eigenvalues� and ui the eigenvectors� of the matrix HH�� Theeigenvalues are non�negative and� we assume� ordered from largest to smallest so thatif p m then �i � for i m� The eigenvectors are orthonormal �u�i ui� � �ii���

We want to derive expressions for the terms in ����� using just the eigenvalues andeigenvectors of HH�� As a preliminary step� we derive some more basic relations�First� the matrix inverse in each re�estimation is

A�� �H�H� � Im

���

V�S�SV � �V�V

��� V

S�S� � Im

��V� � �B���

Note that the second step would have been impossible if the regularisation term�� Im� had not been proportional to the identity matrix� which is where the analysisbreaks down in the case of multiple regularisation parameters� Secondly� the optimalweight vector is

�w � A��H�y

� VS�S� � Im

��S�U�y

� VS�S� � Im

��S��y �B���

where �y is the projection of y onto the eigenbasis U�

Page 24: Avances Base Radial

�� Appendix

Thirdly� from �B��� we can further derive

� m� � trA��

� m� � trVS�S� � Im

��V�

� m� � trS�S� � Im

��

� m�mXj��

�j � �

mXj��

�j�j � �

pXi��

�i�i � �

�B���

Here we have assumed p � m so the last step follows �for � � because if p mthen the last �p �m� eigenvalues are zero� However� the conclusion is also true ifp � m since in that case the last �m � p� singular values are annihilated in theproduct S�S�

Fourthly� and last of the preliminary calculations� the vector of residual errors is

�e � y �H �w

�Ip �US �S�S� � Im�

��S�U�y

� UIp � S �S�S� � Im�

��S��y � �B���

Now we are ready to tackle the terms in ������ From �B��� we have

p� � p�pXi��

�i�i � �

pXi��

�i � �� �B���

From �B��� and a set of steps similar to the derivation of �B��� it follows that

� trA�� � � trA��

mXj��

�j � ��

mXj��

��j � ���

�mXj��

�j��j � ���

pXi��

�i��i � ���

� �B���

Page 25: Avances Base Radial

B The Eigensystem of HH� ��

The last step follows in a similar way to the last step of �B���� Next we tackle theterm �w�A�� �w� From �B��� and �B��� we get

�w�A�� �w � �y�SS�S� � Im

��S��y

pXi��

�i �y�i

��i � ���� �B���

The sum of squared residual errors is� from �B����

�e��e � �yIp � S �S�S� � Im�

��S���y

mXj��

���y�j��j � ���

pXi�m��

�y�i

pXi��

���y�i��i � ���

� �B��

For this derivation we assumed that p � m but� for reasons similar to those statedfor the derivation of �B���� the result is also true for p � m�

Equations �B���B�� express each of the four terms in ����� using the eigenvaluesand eigenvectors of HH�� which was our main goal in this appendix� Other usefulexpressions involving the eigensystem of HH� are

ln jPj �

pXi��

ln

���

���i � ��

� p ln�� �pXi��

ln����i � ��

��

y�Py �

pXi��

���y�i���i � ��

where P � Ip �HA��H�� �� is the noise variance and �� is the a priori varianceof the weights �see section ��� For example� if these expressions are substituted inequation ���� for the cost function associated with the marginal likelihood of thedata� the two p ln�� terms cancel� leaving

E�y� � p ln�� � ln jPj� y�Py

��

pXi��

ln����i � ��

��

pXi��

�y�i���i � ��

Page 26: Avances Base Radial

�� REFERENCES

References

��� A�R� Barron and X� Xiao� Discussion of �Multivariate adaptive regressionsplines� by J�H� Friedman� Annals of Statistics� ������� ���

��� C� M� Bishop� M� Svensen� and K�I� Williams� Em optimization of latent�variable density models� In D�S� Touretzky� M�C� Mozer� and M�E� Hasselmo�editors� Advances in Neural Information Processing Systems �� pages ��������MIT Press� Cambridge� MA� ���

��� C�M� Bishop� Neural Networks for Pattern Recognition� Clarendon Press� Ox�ford� ���

��� L� Breiman� J� Friedman� J� Olsen� and C� Stone� Classi�cation and Regression

Trees� Wadsworth� Belmont� CA� ���

��� A�P� Dempster� N�M� Laird� and D�B� Rubin� Maximum likelyhood from in�complete data via the EM algorithm� Journal of the Royal Statistical Society

�B�� ��������� ����

��� J�H� Friedman� Multivariate adaptive regression splines �with discussion�� An�nals of Statistics� �������� ���

��� S� Geman� E� Bienenstock� and R� Doursat� Neural networks and thebias�variance dilemma� Neural Computation� ��������� ���

�� M� Kubat� Decision trees can initialize radial�basis function networks� IEEE

Transactions on Neural Networks� ���������� ��

�� M� Kubat and I� Ivanova� Initialization of RBF networks with decision trees�In Proc� of the �th Belgian�Dutch Conf� Machine Learning� BENELEARN���pages ���� � ���

�� � D�J�C� MacKay� Bayesian interpolation� Neural Computation� ����������������

���� D�J�C� MacKay� Comparison of approximate methods of handling hyperparam�eters� Accepted for publication by Neural Computation� ��

���� J�E� Moody� The e�ective number of parameters� An analysis of generalisationand regularisation in nonlinear learning systems� In J�E� Moody� S�J� Hanson�and R�P� Lippmann� editors� Neural Information Processing Systems � pages������ Morgan Kaufmann� San Mateo CA� ���

���� R�M� Neal and G�E� Hinton� A view of the EM algorithm that justi�es incre�mental� sparse� and other variants� In M�I� Jordan� editor� Learning in Graphical

Models� Kluwer Academic Press� ��

���� M�J�L� Orr� Local smoothing of radial basis function networks� In International

Symposium on Arti�cial Neural Networks� Hsinchu� Taiwan� ���

���� M�J�L� Orr� Regularisation in the selection of radial basis function centres�Neural Computation� ������ ������ ���

Page 27: Avances Base Radial

REFERENCES ��

���� M�J�L� Orr� Introduction to radial basis function networks� Technical report�Institute for Adaptive and Neural Computation� Division of Informatics� Edin�burgh University� ��� www�anc�ed�ac�uk�mjo�papers�intro�ps�

���� M�J�L� Orr� Matlab routines for subset selection and ridge regression in linearneural networks� Technical report� Institute for Adaptive and Neural Computa�tion� Division of Informatics� Edinburgh University� ��� www�anc�ed�ac�uk�mjo�software�rbf�zip�

��� M�J�L� Orr� An EM algorithm for regularised radial basis function networks�In International Conference on Neural Networks and Brain� Beijing� China�October ��

��� M�J�L� Orr� Matlab functions for radial basis function networks� Techni�cal report� Institute for Adaptive and Neural Computation� Division of In�formatics� Edinburgh University� �� Download from www�anc�ed�ac�uk�

mjo�software�rbf��zip��� � J�R� Quinlan� C��� Programs for Machine Learning� Morgan Kaufmann� San

Mateo CA� ���

���� C�E� Rasmussen� R�M� Neal� G�E� Hinton� D� van Camp� Z� Ghahramani�M� Revow� R� Kustra� and R� Tibshirani� The DELVE Manual� ���http���www�cs�utoronto�ca�delve��

���� M�E� Tipping and C�M� Bishop� Mixtures of principle component analysers�Technical Report NCRG��� �� Neural Computing Research Group� AstonUniversity� UK� ���

���� M�E� Tipping and C�M� Bishop� Probabilistic principal component analysis�Technical Report NCRG��� � � Neural Computing Research Group� AstonUniversity� UK� ���