This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Arguably, the most popular and successful of all ensemble generation algorithms, AdaBoost (Adaptive Boosting) is an extension of the original boosting algorithm, that extends boosting to the multi-class problems.
Y. Freund and R. Schapire, “ A decision – theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-129, 1997.
Solves the “the general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumbs.”
AdaBoost generates an ensemble of classifiers, the training data of each is drawn from a distribution that starts uniform and iteratively changes into one that provides more weight to those instances that are misclassified.Each classifier in AdaBoost focuses increasingly on the more difficult to classifiyinstances.The classifiers are then combined through weighted majority voting
Algorithm AdaBoostCreate a discreet distribution of the training data by assigning a weight to each instance. Initially, the distribution is uniform, hence all weights are the sameDraw a subset from this distribution and train a weak classifier with this datasetCompute the error, ε, of this classifier on its own training dataset. Make sure that this error is less than ½.Test the entire training data on this classifier:
If an instance x is correctly classified, reduce its weight proportional to εIf it is misclassified, increase its weight proportional to εNormalize the weights such that they constitute a distribution
Repeat until T classifiers are generatedCombine the classifiers using weighted majority voting
You Betcha!The error of AdaBoost can be shown to be
where εt is the error of the tth hypothesis. Note that this product gets smaller and smaller with each added classifierBut wait…isn’t this against the Occam’s razor…?
For explanation see Freund and Schapire’s paper as well as Schapire’stutorial on boosting and margin theory. More about this later.
The margin of an instance x roughly describes the confidence of the ensemble in its decision:
Loosely speaking, the margin of an instance is simply the difference between the total (or fraction of) vote(s) it receives from correctly identifying classifiers and the maximum (or fraction of) vote(s) it receives by any incorrect class
where kth class is the true class, and μj(x) is the total support (vote) class j receives from all classifiers such that
The margin is therefore the strength of the vote, and the higher the margin, the more confidence there is the classification. Incorrect decisions have negative margins
Large margins indicate a lower bound on generalization error.If all margins are large, the final decision boundary can be obtained using a simpler classifier (similar to polls can predict the outcomes of not-so-close races very early on)
They show that boosting tends to increase margins on the training data examples, and argue that an ensemble classifier with larger margins is a simpler classifierregardless of the number of classifiers that make up the ensemble.More specifically: Let H be a finite space of base classifiers. For any δ>0 and θ>0, with probability 1- δ over the random choice of the training data set S, any ensemble E={h1, …, hT} H combined through weighted majority satisfies⊆
1/ 2
2
log log1 1( ) ( ) logN H
P error P training margin ON
θθ δ
⎛ ⎞⎛ ⎞⎛ ⎞⎜ ⎟≤ ≤ + +⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠
N: number of instances|H|: Cardinality of the classifier space – the weaker the classifier, the smaller the |H|P(error) is independent of the number of classifiers!
AdaBoost.M1 requires that all classifiers have a weighted error no greater than ½.
This is the least that can be asked from a classifier in a two-class problem, since an error of ½ is equivalent to random guessing.The probability of error for random guessing is much higher for multi-class problems (specifically, k-1/k for a k-class problem). Therefore, achieving an error of ½ becomes increasing difficult for larger number of classes, particularly if the weak classifiers are really weak.
AdaBoost.M2 address this problem by removing the weighted error ½restriction, instead defines the pseudo-error, which itself is then required to have an error no larger than ½.
Pseudo-error recognizes that there is information given in the outputs of classifies for non-selected / non-winning classes.
• On the OCR problem, 1 and 7 may look alike, and the classifier give high plausibility outputs to these and low to all others when faced with a 1 or 7.
• Sequence of m examples S= [ ]),(),...,,(),,( 2211 mm yxyxyx with labels yi ∈Y = {1,...,C} drawn from a distribution D,
• Weak learning algorithm WeakLearn, • Integer T specifying number of iterations.
Let ( ) { }{ }, : 1, 2, , , iB i y i m y y= ∈ ≠L
Initialize ( ) ( )1 , 1 ,D i y B for i y B= ∈ . Do for t = 1,2,...,T:
1. Call WeakLearn, providing it with the distribution Dt. 2. Get back a hypothesis [ ]: 0,1th X Y× → 3. Calculate the pseudo-error of ht ( ) ( ) ( )( )
( )t
,
1 2 ( , ) 1 , ,t t i i t ii y B
D i y h x y h x yε∈
= − +∑
4. Set βt = εt / (1 - εt ). 5. Update distribution Dt:
( ) ( ) ( )( )1 2 1 , ,1
( , )( , ) t i i t ih x y h x ytt t
t
D i yD i yZ
β + −+ = ⋅ where ( )∑=
itt iDZ is a
normalization constant chosen so that Dt+1 becomes a distribution function Output the final hypothesis:
We now pose the following question:If after training an algorithm we receive additional data, how can we update the trained classifier to learn new data?None of the classic algorithms we have seen so far, including MLP, RBF, PNN, KNN, RCE, etc. is capable of incrementally updating its knowledge base to learn new informationThe typical procedure is to scratch the previously trained classifier, combine the old and new data, and start from over.This causes all the information learned so far to be lost catastrophic forgettingFurthermore, what if the old data is no longer available, or what about if the new data introduces new classes?
The ensemble of classifiers approach which is generally used for improving the generalization accuracy of a classifier, can be used to address the issue of incremental learning.
One such implementation of ensemble classifiers for incremental learning is Learn++
So, how do we achieve incremental learning?What prevents us – if anything – in the AdaBoost formulation from learning new data, if instances of previously unseen instances are introduced…?
Actually nothing!AdaBoost should work for incremental learning, but it can be made more efficient
Learn++: Modifies the distribution update rule to make the update based on the ensemble decision, not just the previous classifier.
Algorithm Learn++ (with major differences from AdaBoost.M1 indicated by )
Input: For each database drawn from Dk k=1,2,…,K
• Sequence of m training examples Sk = [(x1,y1),(x2, y2),…,(xmk,ymk)]. • Weak learning algorithm WeakLearn. • Integer Tk, specifying the number of iterations.
Do for k=1,2, …, K:
Initialize 1 ( ) 1 ,ikw D i m i= = ∀ , unless there is prior knowledge to select otherwise.
Do for t = 1,2,...,Tk:
1. Set Dt ∑=
=m
itt iw
1)(w so that Dt is a distribution.
2. Randomly choose training data subset TRt according to Dt. 3. Call WeakLearn, providing it with TRt 4. Get back a hypothesis ht : X Y, and calculate the error of ∑
≠=
iit
tyxhi
t iDh)(:
t )( : ε on Sk.
If εt > ½, set t = t – 1, discard ht and go to step 2. Otherwise, compute normalized error as ( )ttt εεβ −= 1 .
5. Call weighted majority, obtain the overall hypothesis ( )∑=∈
=yxht
tYy
tt
H)(:
1logmaxarg β , and
compute the overall error [ ]∑∑=≠
≠==m
iiit
yxHit yxHiDiDΕ t
iit
t1)(:
|)(|)()(
If Et > ½, set t = t – 1, discard Ht and go to step 2. 6. Set Bt = Et/(1-Et), and update the weights of the instances:
[ ]|)(|1)(
, 1)( ,
)()(1
iitt
tt
yxHt
iitt
Biw
otherwiseyxH ifB
iwiw
≠−×=
⎩⎨⎧ =
×=+
Call Weighted majority on combined hypotheses Ht and Output the final hypothesis:
Distribution initialization when new dataset becomes available:Solution: Start with a uniform distribution, and update that distribution based on the performance of the existing ensemble on the new data
When to stop training for each dataset?Solution: Use a validation dataset, if one available.
• Or, keep training until performance on test data peaks mild cheating
Classifier proliferation when new classes are added: Sufficient additional classifiers need to be generated to out-vote existing classifiers which cannot correctly predict the new class.
Learn++.MT: Creates a preliminary class confidence on each instance and updates the weights of classifiers that have not seen a particular class.
Each classifier is assigned a weight based on its performance on the training data.The preliminary class confidence is obtained by summing the weights of all classifiers that picked a given class and dividing by the sum of the weights of all classifiers that have been trained on that class.
Updates (lowers) the weights of classifiers that have not been trained with the new class (i).
∑∑
∈
==
t
it
CTrctt
cxhtt
ic W
WxP
:
)(:)( for c=1,2,…,C
( ): :( ) 1 ( )t
i it c CTr t t c CTr cW x W P x∉ ∉= −
Set of classifier that have seen class c
Set of classifier that have picked class cPreliminary confidence of those classifiers that has seen class c for xi belonging to class c
Learn++.MTAlgorithm Learn++.MT Input: For each dataset Dk k=1,2,…,K
• Sequence of mk instances xi with labels },...,1{ cYy ki =∈ • Weak learning algorithm BaseClassifier. • Integer Tk, specifying the number of iterations.
Do for k=1,2,…,K If k=1 Initialize 0,/1)( 111 === eTmiDw for all i.
Else Go to Step 5: evaluate the current ensemble on new dataset Dk, update weights, and recall current # of classifiers
1
1
kk jj
eT T−
== ∑
Do for t= keT +1, keT +2,…, kk TeT + :
1. Set ∑=
=m
ittt iwwD
1
)( so that Dt is a distribution.
2. Call BaseClassifier with a subset of Dk chosen using Dt. 3. Obtain ht : X Y, and calculate its error: ∑
≠
=iit yxhitt iD
)(:)(ε
If tε > ½, discard ht and go to step 2. Otherwise, compute
normalized error as )1( ttt εεβ −= . 4. CTrt = Yk, to save labels of classes used in training ht. 5. Call DWV to obtain the composite hypothesis Ht. 6. Compute the error of the composite hypothesis ∑
≠
=iit yxHitt iDE
)(:)(
7. Set Bt=Et/(1-Et), and update the instance weights:
⎩⎨⎧ =
×=+ otherwiseyxHifB
wiw iitttt ,1
)(,)(1
Call DWV to obtain the final hypothesis, Hfinal.
Algorithm Dynamically Weighted Voting Input: • Sequence of i=1,…, n training instances or any test instance xi • Classifiers ht. • Corresponding error values, βt. • Classes, CTrt used in training ht.
For t=1,2,…,T where T is the total number classifiers 1. Initialize classifier weights )1log( ttW β= 2. Create normalization factor, Z, for each class
Test procedureLearn++ and Learn++.MT were each allowed to create a set number of classifiers on each dataset. The number of classifiers generated in each training session was chosen to optimize the algorithm’s performance. Learn++ appeared to generate the best results when 6 classifiers were generated on the first dataset, 12 on the next, and 18 on the last. Learn++.MT, on the other hand, performed optimally using 6 classifiers in the first training, 4 on the second, and 6 on the last.
Learn++.MT2 was created to account for the unbalanced data problemWe define unbalanced data as any discrepancy in the cardinality of each dataset used in incremental learning.
If one dataset has substantially more data than the other (s), the ensemble decision might be unfairly biased towards the data with the lower cardinality
Under the generally valid assumptions of• No instance is repeated in any dataset, and• The noise distribution remains relatively unchanged among datasets;
it is reasonable to believe that the dataset that has more instances carries more information.Classifiers generated with such data should therefore be weighted more heavily
It is not unusual to see major discrepancies in the cardinalities of datasets that subsequently become available. The cardinality of each dataset, including relative cardinalities of individual classes within a dataset, should be taken into consideration in any ensemble based learning algorithm that employs a classifier combination scheme.
The primary novelty in Learn++.MT2 is the way by which the voting weights are determined
Learn++.MT2 attempts to addresses the unbalanced data problem by keeping track of the number of instances from each class with which each classifier is trained Each classifier is first given a weight based on its performance on its own training data
• This weight is later adjusted according to its class conditional weight factor, wt,c
• For each classifier, this ratio is proportional to the number of instances from a particular class used for training that classifier, to the number of instances from that class used for training all classifiers thus far within the ensemble
The final decision is made similarly to Learn++ but with using the class conditional weights
c
ctct N
npw =,
∑=∈
=cxht
ctYc
ifinalitk
wxH)(:
,maxarg)(
pt: Training performance of the tth classifier,nc
: # of class-c instances in the current datasetNc: # of all class-c instances seen so far
Learn++.MT2 has been tested on three databases:Wine database from UCI (3 classes, 13 features);Optical Character Recognition database from UCI (10 classes, 64 features); A real-world gas identification problem for determining one of five volatile organic compounds (VOC) based on chemical sensor data (5 classes, 6 features).
Base classifiers were all single layer MLPs, normally incapable of learning incrementally, with 12~40 nodes and an error goal of 0.05 ~ 0.025. In each case the data distributions were designed to simulate unbalanced data.
To make a comparison between Learn++ and Learn++.MT2, a set number of classifiers are created instead of selecting the number adaptively. This number was selected as a result of experimental testing, such that the tests show an accurate and unbiased comparison of the algorithms.
• Again, performances are virtually identical after TS1.
• After TS2, Learn++.MT2 outperforms Learn++;
• No performance degradation is seen with Learn++.MT2 after second dataset is introduced. Learn++.MT2 is more stable. Precise termination point is not required.
• Reversed scenario: little information is initially provided, followed by more substantial data. Final performances remains unchanged: the algorithm is immune to the order of presentation.
• The momentary dip in Learn++.MT2 performance as a new dataset is introduced, ironically justifies the approach taken. Why…?
Is the distribution update rule used on Learn++ optimal? Can a weighted combination of AdaBoost and Learn++ update rule be better?Is there a better initialization scheme?Can Learn++ be used in a non-stationary learning environment, where the data distribution rule changes (in which case, it may be necessary to forget some of the previously learned information – throw away some classifiers)How can Learn++ be update / initialized if the training data is known to be very unbalanced with new classes being introduced?Can the performance of Learn++ on incremental learning be theoretically justified?Does Learn++ create more or less diverse classifiers? An analysis of the algorithm on several diversity measures.Can Learn++ be used on function approximation problems?How does Learn++ behave under different combination scenarios?