Jun 27, 2015
Recent Advances in
Radial Basis Function
Networks
Mark J� L� Orr�
Institute for Adaptive and Neural Computation
Division of Informatics� Edinburgh University
Edinburgh EH� �LW� Scotland� UK
June ��� ����
Abstract
In ���� an Introduction to Radial Basis Function Networks was published on
the web� along with a package of Matlab functions�� The emphasis was on the
linear character of RBF networks and two techniques borrowed from statistics�
forward selection and ridge regression�
This document� is an update on developments between ���� and ���� and
is associated with a second version of the Matlab package�� Improvements have
been made to the forward selection and ridge regression methods and a new
method� which is a cross between regression trees and RBF networks� has been
developed�
� mjo�anc�ed�ac�uk� www�anc�ed�ac�uk��mjo�papers�intro�ps� www�anc�ed�ac�uk��mjo�software�rbf�zip� www�anc�ed�ac�uk��mjo�papers�recad�ps� www�anc�ed�ac�uk��mjo�software�rbf��zip
� CONTENTS
Contents
� Introduction �
��� MacKay�s Hermite Polynomial � � � � � � � � � � � � � � � � � � � � � �
��� Friedman�s Simulated Circuit � � � � � � � � � � � � � � � � � � � � � � �
� Maximum Marginal Likelihood �
��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� The EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� The DM Algorithm � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
� Optimising the Size of RBFs �
��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Review � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� E�cient Re�estimation of � � � � � � � � � � � � � � � � � � � � � � � � �
��� Avoiding Local Minima � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� The Optimal RBF Size � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Trial Values in Other Contexts � � � � � � � � � � � � � � � � � � � � � ��
��� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Regression Trees and RBF Networks ��
��� Introduction � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� The Basic Idea � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Generating the Regression Tree � � � � � � � � � � � � � � � � � � � � � ��
��� From Hyperrectangles to RBFs � � � � � � � � � � � � � � � � � � � � � ��
��� Selecting the Subset of RBFs � � � � � � � � � � � � � � � � � � � � � � ��
��� The Best Parameter Values � � � � � � � � � � � � � � � � � � � � � � � �
��� Demonstrations � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Conclusions � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Appendix ��
A Applying the EM Algorithm � � � � � � � � � � � � � � � � � � � � � � � ��
B The Eigensystem of HH� � � � � � � � � � � � � � � � � � � � � � � � � ��
Introduction �
� Introduction
In �� an introduction to radial basis function �RBF� networks was published on theweb ���� along with an associated Matlab software package ����� The approach takenstressed the linear character of RBF networks� which traditionally have only a singlehidden layer� and borrowed techniques from statistics� such as forward selection andridge regression� as strategies for controlling model complexity� the main challengefacing all methods of nonparametric regression�
That was three years ago� Since then� some improvements have been made� anew algorithm devised and the package of Matlab functions is now in its secondversion ���� This document describes the theory of the new developments and willbe of interest to practitioners using the new software package and theorists enhancingexisting methods or developing new ones�
Section � describes what happens when the expectation�maximisation algorithmis applied to RBF networks� Section � describes a simple procedure for optimisingthe RBF widths� particularly for ridge regression� Finally� section � describes thenew algorithm which uses a regression tree to generate the centres and sizes of aset of candidate RBFs and to help select a subset of these for the network� Twosimulated data sets� used for demonstration� are described below�
��� MacKay�s Hermite Polynomial
The �rst data set is from �� � and is based on a one�dimensional Hermite polynomial
y � � � ��� x� �x�� e�x�
�
� input values are sampled randomly between �� � x � � and Gaussian noise ofstandard deviation � � �� is added to the outputs ��gure �����
−4 −2 0 2 40
1
2
3
x
y
training set actual function
Figure ���� Sample Hermite data �stars� and the actual function �curve��
� Introduction
��� Friedman�s Simulated Circuit
This second data set simulates an alternating current circuit with four parameters�resistance �R ohms�� angular frequency �� radians per second�� inductance �L hen�ries� and capacitance �C farads� in the ranges
� R � � �
� � � � � �� � �
� L � � �
�� � �� � C � ��� � �� �
� random samples of the four parameters in these ranges were used to generatecorresponding values of the impedance�
Z �
qR� � �� L� ��� C�� �
to which Gaussian noise of standard deviation � � ��� was added� This resultedin a training set of � cases with four�dimensional inputs x � �R � L C�� and ascalar output y � Z� The problem originates from ���� Before applying any learningalgorithms to this data� the original inputs� with their very di�erent dynamic ranges�are rescaled to the range ���� �� in each component�
Maximum Marginal Likelihood �
� Maximum Marginal Likelihood
��� Introduction
The expectation�maximisation �EM� algorithm ��� �� performs maximum likelihoodestimation for problems in which some of the variables are unobserved� Recentlyit has been successfully applied to density estimation ��� and probabilistic principlecomponents ���� ���� for example� This section discusses the application of EM toRBF networks�
First we review the probability model of a linear neural network and come upwith an expression for the marginal likelihood of the data� It is this likelihood whichwe ultimately want to maximise� Then we show the results of applying the EMalgorithm� a pair of re�estimation formulae for the model parameters� However� itturns out that a similar set of re�estimation formula can be derived by a simplermethod and also that they converge more rapidly than the EM versions� Finally� wedraw some conclusions�
��� Review
The model estimated by a linear neural network from noisy samples f�xi� yi�gpi��can be written
f�x� �mXj��
wj hj�x� � �����
where the fhjgmj�� are �xed basis functions and fwjgmj�� are unknown weights �to beestimated�� The vector of residual errors between model and data is
e � y �Hw �
whereH is the design matrix and has elements Hij � hj�xi�� In a Bayesian approachto analysing the estimation process� the a priori probability of the weights w canbe modelled as a Gaussian of variance ���
p�w� � ��m exp
��w
�w
� ��
�� �����
The conditional probability of the data y given the weights w can also be modelledas a Gaussian� with variance ��� to account for the noise included in the outputs ofthe training set� fyigpi���
p�yjw� � ��p exp
��e
�e
���
�� �����
The joint probability of data and weights is the product of p�w� with p�yjw� andcan be represented as an equivalent cost function by taking logarithms� multiplyingby �� and dropping constant terms to obtain
E�y� w� � p ln �� �m ln �� �e�e
���w�w
��� �����
� Maximum Marginal Likelihood
The conditional probability of the weights w given the data y is found using Bayesrule� again involves the product of ����� with ����� and is another Gaussian�
p�wjy� � p�yjw� p�w�
� jWj���� exp���
��w� �w��W���w � �w�
�� �����
where
�w � A��H�y �
W � ��A�� �
A � H�H� � Im �
� ���
��� �����
Finally� the marginal likelihood of the data is
p�y� �
Zp�yjw� p�w� dw
� ��p jPj��� exp��y
�Py
���
�� �����
where
P � Ip �HA��H� �
Note that there is an equivalent cost function for p�y� which is obtained by takinglogarithms� multiplying by �� and dropping the constant terms�
E�y� � p ln�� � ln jPj� y�Py
��� ����
��� The EM Algorithm
The EM algorithm estimates the parameters of a model iteratively� starting fromsome initial guess� Each iteration consists of an expectation �E� step which �ndsthe distribution for the unobserved variables and a maximisation �M� step whichre�estimates the parameters of the model to be those with the maximum likelihoodfor the observed and missing data combined�
In the context of a linear neural network it is possible to consider the trainingset f�xi� yi�gpi�� as the observed data� the weights fwjgmj�� as the missing data and
the variance of the noise �� and the a priori variance of the weights �� as the modelparameters�
In the E�step� the expectation of the conditional probability of the missing data����� is taken and substituted� in the M�step� into the joint probability of the com�bined data� or its equivalent cost function ������ which is then optimised with respectto the model parameters �� and ��� These two steps are guaranteed to increase themarginal probability of the observed data and when iterated convergence to a localmaximum�
��� The DM Algorithm �
Detailed analysis �see appendix A� results in a pair of re�estimation formulae forthe parameters �� and ���
�� ��e��e� ��
p� ����
�� ��w� �w � �m� � ��
m� ���� �
where
�e � y �H �w �
� m� � trA�� �
Initial guesses are substituted into the right hand sides which produce new guesses�The process is repeated until a local minimum of ���� is reached�
Note that equation ���� � was derived in ���� by a free energy approach� It hasbeen shown that free energy and the EM algorithm are intimately connected �����
Figure ��� illustrates with the Hermite data described in section ���� Centres ofradius r � � were created for each training set input� The �gure plots logarithmiccontours of ���� and the sequence of �� and �� values re�estimated by ���� ��� ��
−4 −2 0−4
−2
0
2
log(σ2)
log(ς2 )
Figure ���� Optimisation of �� and �� by EM�
��� The DM Algorithm
An alternative approach to minimising ����� is simply to di�erentiate it and set theresults to zero� This is easily done and results in the pair of re�estimation formulae
�� ��e��e
p� � ������
�� ��w��w
� ������
Maximum Marginal Likelihood
I call this method the �DM algorithm� after David MacKay who �rst derivedthese equations �� �� Its disadvantage is the absence of any guarantee that theiterations converge� unlike their EM counterparts ���� ��� � which are known toincrease the marginal likelihood �or leave it the same if a �xed point has beenreached�� Any �xed point of DM is also a �xed point of EM� and vice versa� but ifthere are multiple �xed points there is no guarantee that both methods will convergeto the same one� even when starting from the same guess�
Figure ��� plots the sequence of re�estimated values using ������ ����� for thesame training set� RBF network and initial values of �� and �� used for �gure ����It is apparent that convergence is faster for DM than for EM in this example� taking� iterations for DM compared to � for EM� In fact� our empirical observation isthat DM always converges considerably faster than EM if they start from the sameguess and converge to the same local minimum� Furthermore� DM has never failedto converge�
−4 −2 0−4
−2
0
2
log(σ2)
log(ς2 )
Figure ���� Optimisation of �� and �� by DM�
��� Conclusions
We started by applying the EM algorithm to RBF networks using the weight�decay�ridge regression� style of penalised likelihood and ended with a pair of re�estimationformulae for the noise variance �� and prior weight variance ��� However� theseturned out to be less e�cient than a similar pair of formulae which had been knownin the literature for some time�
The rbf rr � method in the Matlab software package ��� has an option to usemaximum marginal likelihood �MML� as the model selection criterion �instead ofGCV or BIC� for example�� When this option is selected the regularisation parameter����� is re�estimated using� by default� the DM equations ������ ������ Another optioncan be set so that the EM versions ���� ��� � are used instead�
Optimising the Size of RBFs �
� Optimising the Size of RBFs
��� Introduction
In previous work ���� ��� we concentrated on methods for optimising the regular�isation parameter� �� of an RBF network� However� another key parameter is thesize of the RBFs and until now no methods have been provided for its optimisation�This section describes a simple scheme to �nd an overall scale size for the RBFs ina network�
We �rst review the basic concepts already covered elsewhere ���� and then de�scribe an improved version of the re�estimation formula for the regularisation pa�rameter which is considerably more e�cient and allows multiple initial guesses for� to be optimised in an e�ort to avoid getting trapped in local minima �the detailsare given in appendix B�� We then describe a method for choosing the best overallsize for the RBFs from a number of trial values which is rendered tractable by thee�cient optimisation of �� We then make some concluding remarks�
��� Review
In a linear model with �xed basis functions fhjgmj�� and weights fwjgmj���
f�x� �mXj��
wj hj�x� � �����
the model complexity can be controlled by the addition of a penalty term to the sumof squared errors over the training set� f�xi� yi�gpi��� When this combined error�
E �
pXi��
�yi � f�xi��� � �
mXj��
w�j �
is optimised� large components in the weight vector w are inhibited� This kind ofpenalty is known as ridge regression or weight�decay and the parameter �� whichcontrols the amount of penalty� is known as the regularisation parameter� While thenominal number of free parameters is m �the weights�� the e�ective number is less�due to the penalty term� and is given ���� by
� m� � trA�� � �����
A � H�H� � Im � �����
where H is the design matrix with elements Hij � hj�xi�� The expression for is monotonic in � so model complexity can be decreased �or increased� by raising�or lowering� the value of ��
The parameter � has a Bayesian interpretation� it�s the ratio of ��� the noisecorrupting the training set outputs� to ��� the a priori variance of the weights �seesection ��� If the value of � is known then the optimal weight is
�w � A��H�y � �����
� Optimising the Size of RBFs
However� neither �� nor �� may be available in a practical situation so it is usuallynecessary to establish an e�ective value for � in parallel with optimising the weights�This may be done with model selection criterion such as BIC �Bayesian informa�tion criterion�� GCV �generalised cross�validation� or MML �maximum marginalisedlikelihood� see section �� and in particular with one or more re�estimation formula�For GCV the single formula is
� �
p�
�e��e
�w�A�� �w� �����
where
�e � y �H �w �
� tr�A�� � �A��
��
An initial guess for � is used to evaluate the right hand side of ����� which producesa new guess� The resulting sequence of re�estimated values converge to a localminimum of GCV� Each iteration requires the inverse of the m�by�m matrix A andtherefore costs of order m� �oating point operations�
��� E�cient Re�estimation of �
The optimisation of � by iteration of the re�estimation formula is burdened by thenecessity of having to compute an expensive matrix inverse every iteration� How�ever� by a reformulation of the individual terms of the equation using the eigen�values and eigenvectors of HH� it is possible to perform most of the work duringthe �rst iteration and reuse the results for subsequent ones� Thus the amount ofcomputation required to complete an optimisation which takes q steps to convergeis reduced by almost a factor of ��q� Unfortunately� the technique only works for asingle global regularisation parameter ����� not for multiple parameters applying todi�erent groups of weights or to individual weights �����
Suppose the eigenvalues and eigenvectors of HH� are f�igpi�� and fuigpi�� andthat the projections of y onto the eigenvectors are �yi � y�ui� Then� as shown inappendix B� the four terms involved in the re�estimation formula ����� are
�
pXi��
�
�i � �� �����
p� �
pXi��
�i��i � ���
� �����
�e��e �
pXi��
�i �y�i
��i � ���� ����
�w�A�� �w �
pXi��
���y�i��i � ���
� ����
If � is re�estimated by computing ��������� instead of explicitly calculating theinverse in ������ then the computational cost of each iteration is only of order p�
��� Avoiding Local Minima ��
instead of m�� The overhead of initially calculating the eigensystem� which is oforder p�� has to be taken into account but is only performed once� For problems inwhich p is not much bigger thanm this represents a signi�cant saving in computationtime and makes it feasible to optimise multiple guesses for the initial value of � todecrease the chances of getting caught in a local minimum�
��� Avoiding Local Minima
If the initial guess for � is close to a local minimum of GCV �or whatever modelselection criterion is employed� then re�estimation using ����� is likely to get trapped�We illustrate by using Friedman�s data set as described in section ��� with an RBFnetwork of � Gaussian centres coincident with the inputs of the training set andof �xed radius r � �
The solid curve in �gure ��� shows the variation of GCV with �� The open circlesshow a sequence of re�estimated � values with their corresponding GCV scores� Theinitial guess was � � � �� and the sequence converged near � � � �� �shown by theclosed circle� at a local minimum� The global minimum near � � � ��� was missed�
−12 −10 −8 −6 −4 −2 04.7
4.8
4.9
5
5.1
5.2
5.3
log(λ)
log(
GC
V)
Figure ���� The variation of GCV with � for Friedman�s problem and a sequence ofre�estimations starting at � � � ���
Compare �gure ��� with �gure ��� where the only change was to use a di�erentguess� � ��� for the initial value of �� This time the guess is su�ciently close tothe global minimum that the re�estimations are attracted towards it� Note thatthe set of eigenvalues and eigenvectors used to compute the sequences in �gures ���and ��� are identical� Since the calculation of the eigensystem dominates the othercomputational costs it is almost as expensive to optimise one trial value as it is tooptimise several� Thus to avoid falling into a local minimum several trial valuesspread over a wide range can be optimised and the solution with the lowest GCVselected as the overall winner� This value can then be used to determine the weights����� ����� and ultimately the predictions ������ of the network�
�� Optimising the Size of RBFs
−12 −10 −8 −6 −4 −2 04.7
4.8
4.9
5
5.1
5.2
5.3
log(λ)
log(
GC
V)
Figure ���� Same as �gure ��� except the initial guess is � � � ���
��� The Optimal RBF Size
For Gaussian radial functions of �xed width r the transfer functions of the hiddenunits are
hj�x� � exp
�� �x� cj�
��x� cj�
r�
��
Unfortunately� there is no re�estimation formula for r� as there is for �� even inthis simple case where the same scale is used for each RBF and each component ofthe input ���� To properly optimise the value of r would thus require the use of anonlinear optimisation algorithm and would have to incorporate the optimisation of� �since the optimal value of � changes as r changes��
An alternative� if rather crude approach� is to test a number of trial values for r�For each value an optimal � is calculated �by using the re�estimation method above�and the model selection score noted� When all the values have been checked� theone associated with the lowest score wins� The computational cost of this procedureis dominated� once again� by the cost of computing the eigenvalues and eigenvectorsof HH�� and these have to be calculated separately for each value of r�
While this procedure is less computationally demanding than a full nonlinearoptimisation of r and � it�s drawback is that it is only capable of identifying thebest value for r from a �nite number of alternatives� On the other hand� given thatthe value of � is fully optimised and that the model selection criteria are heuristic�in other words� approximate� in nature� it is arguable that a more precise locationfor the optimal value of r is unlikely to have much practical signi�cance�
We illustrate the method on the Hermite data as described in section ���� Onceagain we use each training set input as an RBF centre� We tried seven di�erent trialvalues for r� ��� ��� �� �� � ���� ��� and ���� For each trial value we plot� in �gure���� the variation of GCV with � �the curves�� as well as the optimal � �the closed
��� Trial Values in Other Contexts ��
−10 −8 −6 −4 −2 0−1.9
−1.8
−1.7
−1.6
−1.5
log(λ)
log(
GC
V)
0.40.60.81.01.21.41.6
Figure ���� The Hermite data set with four sizes of RBFs�
circles� found by re�estimation as described above� The radius value which led tothe lowest GCV score was r � �� and the corresponding optimal regularisationparameter is � � ���
Initially� as r increases from its lowest value� the GCV score at the optimum �decreases� Eventually it reaches its lowest value at r � �� � Above that there is notmuch increase in optimised GCV� although the optimal � decreases rapidly�
�� Trial Values in Other Contexts
The use of trial values is limited to cases where there is a small number of parametersto optimise� such as the single parameter r� If there are several parameters withtrial values then the number of di�erent combinations to evaluate can easily becomeprohibitively large� In RBF networks where there is a separate scale parameterfor each dimension� so that the transfer functions are� for example in the case ofGaussians�
hj�x� � exp
��
nXk��
�xk � cjk��
r�jk
��
there would be tmn combinations to check where t is the number of trial valuesfor each rjk� m is the number of basis functions and n the number of dimensions�However� it is possible to test trial values for an overall scale size � if some othermechanism can be used to generate the scales rjk� Here� the transfer functions are
hj�x� � exp
��
nXk��
�xk � cjk��
�� r�jk
��
This is the approach taken for the method in section � where a regression treedetermines the values of rjk but the overall scale size � is optimised by testing trialvalues�
�� Optimising the Size of RBFs
�� Conclusions
We�ve shown how trial values for the overall size of the RBFs can be compared usinga model selection criterion� In the case of ridge regression� an e�cient method foroptimising the regularisation parameter helps reduce the computational burden oftraining a separate network for each trial value� However� the same technique canalso be used with other methods of complexity control� including those in whichthere is no regularisation�
In the Matlab software package ��� each method can be con�gured with a set oftrial values for the overall RBF scale� The best value is chosen and used to generatethe RBF network which the Matlab function returns�
Regression Trees and RBF Networks ��
� Regression Trees and RBF Networks
��� Introduction
This section is about a novel method for nonparametric regression involving a combi�nation between regression trees and RBF networks ��� The basic idea of a regressiontree is to recursively partition the input space in two and approximate the functionin each half by the average output value of the samples it contains ���� Each splitis parallel to one of the axes so it can be expressed by an inequality involving oneof the input components �e�g� xk b�� The input space is thus divided into hy�perrectangles organised into a binary tree where each branch is determined by thedimension �k� and boundary �b� which together minimise the residual error betweenmodel and data�
A bene�t of regression trees is the information provided in the split statisticsabout the relevance of each input variable� The components which carry the mostinformation about the output tend to be split earliest and most often� A weaknessof regression trees is the discontinuous model caused by the output value jumpingacross the boundary between two hyperrectangles� There is also the problem ofdeciding when to stop growing the tree �or equivalently� how much to prune after ithas fully grown� which is the familiar bias�variance dilemma faced by all methods ofnonparametric regression ���� The use of radial basis functions in conjunction withregression trees can help to solve both these problems�
Below we outline the basic method of combining RBFs and regression trees asit appeared originally and describe our version of this idea and why we think it isan improvement� Finally we show some results and summarise our conclusions�
��� The Basic Idea
The combination of trees and RBF networks was �rst suggested by �� in the con�text of classi�cation rather than regression �though the two cases are very similar��Further elaboration of the idea appeared in ��� Essentially� each terminal node ofthe classi�cation tree contributes one hidden unit to the RBF network� the centreand radius of which are determined by the position and size of the correspondinghyperrectangle� Thus the tree sets the number� positions and sizes of all RBFs inthe network� Model complexity is controlled by two parameters� �c determines theamount of tree pruning in C��� �� � �the software package used by �� to generateclassi�cation trees� and � �xes the size of RBFs relative to hyperrectangles�
Our major reservation about the approach taken by �� is the treatment of modelcomplexity� In the case of the scaling parameter ���� the author claimed it had littlee�ect on prediction accuracy� but this is not in accord with our previous experienceof RBF networks� As for the amount of pruning ��c�� he demonstrated its e�ecton prediction accuracy yet used a �xed value in his benchmark tests� Moreover�there was no discussion of how to control scaling and pruning to optimise modelcomplexity for a given data set�
�� Regression Trees and RBF Networks
Our method is a variation on Kubat�s with the following alterations�
�� We address the model complexity issue by using the nodes of the regressiontree not to �x the RBF network but rather to generate a set of RBFs fromwhich the �nal network can be selected� Thus the burden of controlling modelcomplexity shifts from tree generation to RBF selection�
�� The regression tree from which the RBFs are produced can also be used toorder selections such that certain candidate RBFs are allowed to enter themodel before others� We describe one way to achieve such an ordering anddemonstrate that it produces more accurate models than plain forward selec�tion�
�� We show that� contrary to the conclusions of ��� the method is typically quitesensitive to the parameter � and discuss its optimisation by the use of multipletrial values�
��� Generating the Regression Tree
The �rst stage of our method �and Kubat�s� is to generate a regression tree� Theroot node of the tree is the smallest hyperrectangle which contains all the trainingset inputs� fxigpi��� Its size sk �the half�width� and centre ck in each dimension kare
sk ��
�
�maxi�S
�xik��mini�S
�xik�
��
ck ��
�
�maxi�S
�xik� � mini�S
�xik�
��
where S � f�� � � � � � pg is the set of training set indices� A split of the root nodedivides the training samples into left and right subsets� SL and SR� on either side ofa boundary b in one of the dimensions k such that
SL � fi � xik � bg �SR � fi � xik bg �
The mean output value on either side of the split is
�yL ��
pL
Xi�SL
yi �
�yR ��
pR
Xi�SR
yi �
where pL and pR are the number of samples in each subset� The residual squareerror between model and data is then
E�k� b� ��
p
��Xi�SL
�yi � �yL�� �
Xi�SR
�yi � �yR��
�A �
��� From Hyperrectangles to RBFs ��
The split which minimisesE�k� b� over all possible choices of k and b is used to createthe children of the root node and is easily found by discrete search over n dimensionsand p cases� The children of the root node are split recursively in the same mannerand the process terminates when a node cannot be split without creating a childcontaining less samples than a given minimum� pmin� which is a parameter of themethod� Compared to their parent nodes� the child centres will be shifted and theirsizes reduced in the k�th dimension�
Since the size of the regression tree does not determine the model complexity�there is no need to perform the �nal pruning step normally associated with recursivesplitting methods ��� � � ��
��� From Hyperrectangles to RBFs
The regression tree contains a root node� some nonterminal nodes �having children�and some terminal nodes �having no children�� Each node is associated with ahyperrectangle of input space having a centre c and size s as described above� Thenode corresponding to the largest hyperrectangle is the root node and the nodesizes decrease down the tree as they are divided into smaller and smaller pieces�To translate a hyperrectangle into a Gaussian RBF we use its centre c as the RBFcentre and its size s scaled by a parameter � as the RBF radius� r � � s� Thescalar � has the same value for all nodes and is another parameter of the method�in addition to pmin�� Our � is not quite the same as Kubat�s � �they�re related byan inverse and a factor of
p�� but plays exactly the same role�
��� Selecting the Subset of RBFs
After the tree nodes are translated into RBFs the next step of our method is toselect a subset of them for inclusion in the model� This is in contrast to the methodof �� where all RBFs from terminal nodes were included in the model which wasthus heavily dependent on the extent of tree pruning to control model complexity�Selection can be performed using either a standard method such as forward selection���� ��� or in a novel way� by employing the tree to guide the order in which candidateRBFs are considered�
In the standard methods for subset selection the RBFs generated from the regres�sion tree are treated as an unstructured collection with no distinction between RBFscorresponding to di�erent nodes in the tree� However� intuition suggests that thebest order to consider RBFs for inclusion in the model is large ones �rst and smallones last� to synthesise coarse structure before �ne details� This� in turn� suggestssearching for RBF candidates by traversing the tree from the largest hyperrectangle�and RBF� at the root to the smallest hyperrectangles �and RBFs� at the terminalnodes� Thus the �rst decision should be whether to include the root node in themodel� the second whether to include any of the children of the root node� and soon� until the terminal nodes are reached�
The scheme we eventually developed for selecting RBFs goes somewhat beyondthis simple picture and was in�uenced by two other considerations� The �rst con�cerns a classic problem with forward selection� namely� that one regressor can blockthe selection of other more explanatory regressors which would have been chosen in
� Regression Trees and RBF Networks
preference had they been considered �rst� In our case there was a danger that aparent RBF could block its own children� To avoid this situation� when consider�ing whether to add the children of a node which had already been selected we alsoconsidered the e�ect of deleting the parent� Thus our method has a measure of back�ward elimination as well as forward selection� This is reminiscent of the selectionschemes developed for the MARS ��� and MAPS ��� algorithms�
A second reason for departing from a simple breadth��rst search is because thesize of a hyperrectangle �in terms of volume� on one level is not guaranteed to besmaller than the size of all the hyperrectangles in the level above �only its parent�so it not easy to achieve a strict largest�to�smallest ordering� In view of this� weabandoned any attempt to achieve a strict ordering and instead devised a searchalgorithm which dynamically adjusts the set of selectable RBFs by replacing selectedRBFs with their children�
The algorithm depends on the concept of an active list of nodes� At any givenmoment during the selection process only these nodes and their children are con�sidered for inclusion or exclusion from the model� Every time RBFs are added orsubtracted from the model the active list expands by having a node replaced by itschildren� Eventually the active list becomes coincident with the terminal nodes andthe search is terminated� In detail� the steps of the algorithm are as follows�
�� Initialise the active list with the root node and the model with the root node�sRBF�
�� For all nonterminal nodes on the active list consider the e�ect �on the modelselection criterion� of adding both or just one of the children�s RBFs �threepossible modi�cations to the model�� If the parent�s RBF is already in themodel� also consider the e�ect of �rst removing it before adding one or bothchildren�s RBFs or of just removing it �a further four possible modi�cations��
�� The total number of possible adjustments to the model is somewhere betweenthree and seven times the number of active nonterminal nodes� depending onhow many of their RBFs are already in the model� From all these possibilitieschoose the one which most decreases the model selection criterion� Update thecurrent model and remove the node involved from the active list� replacing itwith its children� If none of the modi�cations decrease the selection criterionthen chose one of the active nodes at random and replace it by its children butleave the model unaltered�
�� Return to step � and repeat until all the active nodes are terminal nodes�
Once the selection process has terminated the network weights can be calculatedin the usual way by solving the normal equation�
w �H�H
��H�y �
where H is the design matrix� There is no need for a regularisation term� as ap�pears in equations ����� ���� for example� because model complexity is limited by theselection process�
��� The Best Parameter Values ��
�� The Best Parameter Values
Our method has three main parameters� the model selection criterion� pmin whichcontrols the depth of the regression tree and � which determines the relative sizebetween hyperrectangles and RBFs�
For the model selection criterion we found that the more conservative BIC� whichtends to produce more parsimonious models� rarely performed worse than GCVand often did signi�cantly better� This is in line with the experiences of otherpractitioners of algorithms based on subset selection such as ���� who modi�ed GCVto make it more conservative� and ���� who also found BIC gave better results thanGCV�
For pmin and � we use the simple method of comparing the model selectionscores of a number of trial values� as for the RBF widths in section �� This meansgrowing several trees �one for each trial value of pmin� and then �for each tree�selecting models from several sets of RBFs �one for each value of ��� The cost isextra computation� the more trial values there are� the longer the algorithm takesto search through them� However� the basic algorithm is not unduly expensive andif the number of trial values is kept fairly low �about � or less alternatives for eachparameter�� the computation time is acceptable�
�� Demonstrations
Figure ��� shows the prediction of a pure regression tree for a sample of Hermitedata �section ����� For clarity� the samples themselves are not shown� just thetarget function and the prediction� Of course� the model is discontinuous and eachhorizontal section corresponds to one terminal node in the tree�
−4 −2 0 2 40
1
2
3
x
y
target prediction
Figure ���� A pure regression tree prediction on the Hermite data�
� Regression Trees and RBF Networks
The tree which produced the prediction shown in �gure ��� was grown untilfurther splitting would have violated the minimum number of samples allowed pernode �pmin�� There was no pruning or any other sophisticated form of complexitycontrol� so this kind of tree is not suitable for practical use as a prediction method�However� in our method the tree is only used to create RBF candidates� Modelcomplexity is controlled by a separate process which selects a subset of RBFs forthe network�
Figure ��� shows the predictions of the combined method on the same data setused in �gure ��� after a subset of RBFs were selected from the pool of candidatesgenerated by the tree nodes� Now the model is continuous and its complexity is wellmatched to the data�
−4 −2 0 2 40
1
2
3
x
y
target prediction
Figure ���� The combined method on the Hermite data�
As a last demonstration we turn our attention to Friedman�s data set �sec�tion ����� In experiments with the MARS algorithm ���� Friedman estimated theaccuracy of his method by replicating data sets � times and computing the meanand standard deviation of the scaled sum�square�error� For this data set his best re�sults� corresponding to the most favourable values for the parameters of the MARSmethod� were ���� � ��
To compare our algorithm with MARS� and also to test the e�ect of using mul�tiple trial values for our method�s parameters� pmin and �� we conducted a similarexperiment� Before we started� we tried some di�erent settings for the trial valuesand identi�ed one which gave good results on test data� Then� for each of the � replications� we applied the method twice� In the �rst run we used the trial valueswe had discovered earlier� In the second run we used only a single �best� valuefor each parameter� the average of the trial values� forcing this value to be used forevery replicated data set� The results are shown in table ��
It is apparent that the results are practically identical to MARS when the fullsets of trial values are used but signi�cantly inferior when only single �best� valuesare used�
�� Conclusions ��
pmin � error
�� �� � �� �� � � � ���� � �
� ��� � �
Table �� Results on � replications of Friedman�s data set�
In another test using replicated data sets we compared the two alternative meth�ods of selecting the RBFs from the candidates generated by the tree� standard for�ward selection or the method described above in section ��� which uses the tree toguide the order in which candidates are considered� This was the only di�erencebetween the two runs� the model parameters were the same as in the �rst row oftable �� The performance of tree�guided selection was ��� � � � �as in table ���but forward selection was signi�cantly worse� �� � �� �
��� Conclusions
We have described a method for nonparametric regression based on combining re�gression trees and radial basis function networks� The method is similar to �� andhas the same advantages �a continuous model and automatic relevance determina�tion� but also some signi�cant improvements� The main enhancement is the additionof an automatic method for the control of model complexity through the selectionof RBFs� We have also developed a novel procedure for selecting the RBFs basedon the structure of the tree�
We�ve presented evidence that the method is comparable in performance to thewell known MARS algorithm and that some of its novel features �trial parametervalues� tree�guided selection� are actually bene�cial� More detailed evaluations withDELVE ���� data sets are in preparation and preliminary results support these con�clusions�
The Matlab software package ��� has two implementations of the method� Onefunction� rbf rt �� uses tree�guided selection� while the other� rbf rt �� uses for�ward selection� The operation of each function is described� with examples� in acomprehensive manual�
�� Appendix
� Appendix
A Applying the EM Algorithm
We want to maximise the marginal probability of the observed data ����� by substi�tuting expectations of the conditional probability of the unobserved data ����� intothe cost function for the joint probability of the combined data ����� and minimis�ing this with respect to the parameters �� �the noise variance� and �� �the a priori
weight variance��
From ������ hwi � �w and h�w� �w� �w� �w��i �W � ��A��� The expectationof w�w is then
hw�wi � tr hww�i� �w� �w� tr hww� � �w �w�i� �w� �w� tr h�w � �w� ��w � �w��i� �w� �w� ��trA��
� �w� �w� ���m� � � �A���
The last step follows from � m � � trA�� �the e�ective number of parameters�and � � ����� �the regularisation parameter�� Similarly�
he�ei � tr he e�i� �e��e� tr he e� � �e�e�i� �e��e� trH hww� � �w �w�iH�
� �e��e� ��trHA��H�
� �e��e� �� � �A���
since e � y � Hw is linear in w and trHA��H� is another expression for thee�ective number of parameters �
Equations �A��� A��� summarise the expectation of the conditional probabilityfor w and can be substituted into the joint probability of the combined data or theequivalent cost function ����� so that the resulting expression can be optimised withrespect to �� and ��� Note that in �A��� A��� these parameters are held constant attheir old values� only the explicit occurrences of �� and �� in ����� are varied in theoptimisation�
After di�erentiating ����� with respect to �� and ��� equating the results to zeroand �nally substituting the expectations �A��� A��� we get the re�estimation formulae
�� ��e��e� ��
p�
�� ��w� �w � �m� � ��
m�
B The Eigensystem of HH� ��
B The Eigensystem of HH�
We want to derive expressions for each of the terms in ����� using the eigenvaluesand eigenvectors of HH� We start with a singular value decomposition of thedesign matrix� H � USV�� where U � �u� u� � � � up� � Rp�p and V � Rm�m areorthogonal and S � Rp�m�
S �
�����������
p�� � � �
p�� � � �
������
� � ����
� � � p�m
� � � ���
���� � �
��� � � �
�������������
contains the singular values� fp�igpi��� Note that� due to the orthogonality of V�
HH� � USS�U�
�
pXi��
�i ui u�i �
so the �i are the eigenvalues� and ui the eigenvectors� of the matrix HH�� Theeigenvalues are non�negative and� we assume� ordered from largest to smallest so thatif p m then �i � for i m� The eigenvectors are orthonormal �u�i ui� � �ii���
We want to derive expressions for the terms in ����� using just the eigenvalues andeigenvectors of HH�� As a preliminary step� we derive some more basic relations�First� the matrix inverse in each re�estimation is
A�� �H�H� � Im
���
V�S�SV � �V�V
��� V
S�S� � Im
��V� � �B���
Note that the second step would have been impossible if the regularisation term�� Im� had not been proportional to the identity matrix� which is where the analysisbreaks down in the case of multiple regularisation parameters� Secondly� the optimalweight vector is
�w � A��H�y
� VS�S� � Im
��S�U�y
� VS�S� � Im
��S��y �B���
where �y is the projection of y onto the eigenbasis U�
�� Appendix
Thirdly� from �B��� we can further derive
� m� � trA��
� m� � trVS�S� � Im
��V�
� m� � trS�S� � Im
��
� m�mXj��
�
�j � �
�
mXj��
�j�j � �
�
pXi��
�i�i � �
�B���
Here we have assumed p � m so the last step follows �for � � because if p mthen the last �p �m� eigenvalues are zero� However� the conclusion is also true ifp � m since in that case the last �m � p� singular values are annihilated in theproduct S�S�
Fourthly� and last of the preliminary calculations� the vector of residual errors is
�e � y �H �w
�Ip �US �S�S� � Im�
��S�U�y
� UIp � S �S�S� � Im�
��S��y � �B���
Now we are ready to tackle the terms in ������ From �B��� we have
p� � p�pXi��
�i�i � �
�
pXi��
�
�i � �� �B���
From �B��� and a set of steps similar to the derivation of �B��� it follows that
� trA�� � � trA��
�
mXj��
�
�j � ��
mXj��
�
��j � ���
�mXj��
�j��j � ���
�
pXi��
�i��i � ���
� �B���
B The Eigensystem of HH� ��
The last step follows in a similar way to the last step of �B���� Next we tackle theterm �w�A�� �w� From �B��� and �B��� we get
�w�A�� �w � �y�SS�S� � Im
��S��y
�
pXi��
�i �y�i
��i � ���� �B���
The sum of squared residual errors is� from �B����
�e��e � �yIp � S �S�S� � Im�
��S���y
�
mXj��
���y�j��j � ���
�
pXi�m��
�y�i
�
pXi��
���y�i��i � ���
� �B��
For this derivation we assumed that p � m but� for reasons similar to those statedfor the derivation of �B���� the result is also true for p � m�
Equations �B���B�� express each of the four terms in ����� using the eigenvaluesand eigenvectors of HH�� which was our main goal in this appendix� Other usefulexpressions involving the eigensystem of HH� are
ln jPj �
pXi��
ln
���
���i � ��
�
� p ln�� �pXi��
ln����i � ��
��
y�Py �
pXi��
���y�i���i � ��
�
where P � Ip �HA��H�� �� is the noise variance and �� is the a priori varianceof the weights �see section ��� For example� if these expressions are substituted inequation ���� for the cost function associated with the marginal likelihood of thedata� the two p ln�� terms cancel� leaving
E�y� � p ln�� � ln jPj� y�Py
��
�
pXi��
ln����i � ��
��
pXi��
�y�i���i � ��
�
�� REFERENCES
References
��� A�R� Barron and X� Xiao� Discussion of �Multivariate adaptive regressionsplines� by J�H� Friedman� Annals of Statistics� ������� ���
��� C� M� Bishop� M� Svensen� and K�I� Williams� Em optimization of latent�variable density models� In D�S� Touretzky� M�C� Mozer� and M�E� Hasselmo�editors� Advances in Neural Information Processing Systems �� pages ��������MIT Press� Cambridge� MA� ���
��� C�M� Bishop� Neural Networks for Pattern Recognition� Clarendon Press� Ox�ford� ���
��� L� Breiman� J� Friedman� J� Olsen� and C� Stone� Classi�cation and Regression
Trees� Wadsworth� Belmont� CA� ���
��� A�P� Dempster� N�M� Laird� and D�B� Rubin� Maximum likelyhood from in�complete data via the EM algorithm� Journal of the Royal Statistical Society
�B�� ��������� ����
��� J�H� Friedman� Multivariate adaptive regression splines �with discussion�� An�nals of Statistics� �������� ���
��� S� Geman� E� Bienenstock� and R� Doursat� Neural networks and thebias�variance dilemma� Neural Computation� ��������� ���
�� M� Kubat� Decision trees can initialize radial�basis function networks� IEEE
Transactions on Neural Networks� ���������� ��
�� M� Kubat and I� Ivanova� Initialization of RBF networks with decision trees�In Proc� of the �th Belgian�Dutch Conf� Machine Learning� BENELEARN���pages ���� � ���
�� � D�J�C� MacKay� Bayesian interpolation� Neural Computation� ����������������
���� D�J�C� MacKay� Comparison of approximate methods of handling hyperparam�eters� Accepted for publication by Neural Computation� ��
���� J�E� Moody� The e�ective number of parameters� An analysis of generalisationand regularisation in nonlinear learning systems� In J�E� Moody� S�J� Hanson�and R�P� Lippmann� editors� Neural Information Processing Systems � pages������ Morgan Kaufmann� San Mateo CA� ���
���� R�M� Neal and G�E� Hinton� A view of the EM algorithm that justi�es incre�mental� sparse� and other variants� In M�I� Jordan� editor� Learning in Graphical
Models� Kluwer Academic Press� ��
���� M�J�L� Orr� Local smoothing of radial basis function networks� In International
Symposium on Arti�cial Neural Networks� Hsinchu� Taiwan� ���
���� M�J�L� Orr� Regularisation in the selection of radial basis function centres�Neural Computation� ������ ������ ���
REFERENCES ��
���� M�J�L� Orr� Introduction to radial basis function networks� Technical report�Institute for Adaptive and Neural Computation� Division of Informatics� Edin�burgh University� ��� www�anc�ed�ac�uk�mjo�papers�intro�ps�
���� M�J�L� Orr� Matlab routines for subset selection and ridge regression in linearneural networks� Technical report� Institute for Adaptive and Neural Computa�tion� Division of Informatics� Edinburgh University� ��� www�anc�ed�ac�uk�mjo�software�rbf�zip�
��� M�J�L� Orr� An EM algorithm for regularised radial basis function networks�In International Conference on Neural Networks and Brain� Beijing� China�October ��
��� M�J�L� Orr� Matlab functions for radial basis function networks� Techni�cal report� Institute for Adaptive and Neural Computation� Division of In�formatics� Edinburgh University� �� Download from www�anc�ed�ac�uk�
mjo�software�rbf��zip��� � J�R� Quinlan� C��� Programs for Machine Learning� Morgan Kaufmann� San
Mateo CA� ���
���� C�E� Rasmussen� R�M� Neal� G�E� Hinton� D� van Camp� Z� Ghahramani�M� Revow� R� Kustra� and R� Tibshirani� The DELVE Manual� ���http���www�cs�utoronto�ca�delve��
���� M�E� Tipping and C�M� Bishop� Mixtures of principle component analysers�Technical Report NCRG��� �� Neural Computing Research Group� AstonUniversity� UK� ���
���� M�E� Tipping and C�M� Bishop� Probabilistic principal component analysis�Technical Report NCRG��� � � Neural Computing Research Group� AstonUniversity� UK� ���