Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Technical Reports on Mathematical and Computing Sciences: TR-C136title: Adaptive sampling methods for scaling up knowledge discovery algorithms(Extended revised version of TR-C131)author: Carlos Domingo1, Ricard Gavalda2, and Osamu Watanabe1a�liation:1 Dept. of Math. and Comp. Science, Tokyo Institute of Technology, Tokyo 158-8522fcarlos, [email protected] Dept. of LSI, Universitat Politecnica de Catalunya, Barcelona, [email protected]�nancial support:Carlos Domingo is supported in part by the EU Science and Technology FellowshipProgram (STF13) of the European Commission. Ricard Gavalda is supported in partby the EC Working Group NeuroCOLT2 (EP27150). Osamu Watanabe is supportedin part by a Grant-in-Aid for Scienti�c Research on Priority Areas \Discovery Science"from the Ministry of Education, Science, Sports and Culture of Japan.Abstract. Scalability is a key requirement for any KDD and data mining algorithm, and one of thebiggest research challenges is to develop methods that allow to use large amounts of data. One possibleapproach for dealing with huge amounts of data is to take a random sample and do data mining on it,since for many data mining applications approximate answers are acceptable. However, as argued byseveral researchers, random sampling is di�cult to use due to the di�culty of determining an appropriatesample size. In this paper, we take a sequential sampling approach for solving this di�culty, and proposean adaptive sampling method that solves a general problem covering many actual problems arising inapplications of discovery science. An algorithm following this method obtains examples sequentially inan on-line fashion, and it determines from the obtained examples whether it has already seen a largeenough number of examples. Thus, sample size is not �xed a priori; instead, it adaptively depends onthe situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happensin many practical applications, then we can solve the problem with a number of examples much smallerthan the required in the worst case. We prove the correctness of our method and estimates its e�ciencytheoretically. For illustrating its usefulness, we consider one concrete example of using sampling, providean algorithm based on our method, and show its e�ciency by experimental evaluation. (This an extendedrevised version of TR-C131.)1 IntroductionScalability is a key requirement for any knowledge discovery and data mining algorithm. It has beenpreviously observed that many well known machine learning algorithms do not scale well. Therefore, oneof the biggest research challenges is to develop new methods that allow to use machine learning techniqueswith large amount of data.Once we are facing with the problem of having a huge input data set, there are typically two possibleways to address it. One way could be to redesign known algorithms so that, while almost maintainingits performance, can be run e�ciently with much larger input data sets. The second possible approach israndom sampling. For most of the data mining applications, approximate answers are acceptable. Thus,we could take a random sample of the instance space and do data mining on it. However, as arguedby several researchers (see, for instance, [34]) this approach is less recommendable due to the di�cultyof determining appropriate sample size needed. In this paper, we advocate for this second approach1

of reducing the dimensionality of the data through random sampling. For this, we propose a generalproblem that covers many situations arising in data mining algorithms, and a general sampling algorithmfor solving it.A typical task of knowledge discovery and data mining is to �nd out some \rule" or \law" explaininga huge set of examples well. It is often the case that the size of possible candidates for such rules is stillmanageable. Then the task is simply to select a rule among all candidates that has certain \utility" onthe dataset. This is the problem we discuss in this paper, and we call it General Rule Selection. Morespeci�cally, we are given an input data set X of examples, a set H of rules, and a utility function U thatmeasures \usefulness" of each rule on X. The problem is to �nd a nearly the best rule h, more precisely,a rule h satisfying U (h) � (1 � �)U (h?), where h? is the best rule and � is a given accuracy parameter.Though simple, this problem covers several key problems occurring in data mining.We would like to solve the General Rule Selection problem by random sampling. From a statisticalpoint of view, this problem can be solved by taking �rst a random sample S from the domainX and thenselecting h 2 H with the largest U (h) on S. If we choose enough number of examples from X randomly,then we can guarantee that the selected h is nearly best within a certain con�dence level. We will referto this simple method as a batch sampling method.One of the most important issues when doing random sampling is choosing proper sample size, i.e.,the number of examples. Any sampling method must take into account problem parameters, an accuracyparameter, and a con�dence parameter to determine appropriate sample size needed to solve the desiredproblem.A widely used method is simply taking a �xed fraction of the data set (say, 70%) for extractingknowledge, leaving the rest, e.g., for validation. A minor objection to this method is that there is notheoretical justi�cation for the fraction chosen. But, more importantly, this is not very appropriate forlarge amounts of data; most likely, a small portion of the data base will su�ce to extract almost the samestatistical conclusions. In fact, we are looking for sampling methods that examine a number of examplesindependent of the size of the database.Widely used and theoretically sound tools to determine appropriate sample size for given accuracyand con�dence parameters are the so called concentration bounds or large deviation bounds like theCherno� or the Hoe�ding bounds. They are commonly used in most of the theoretical learning research(see [17] for some examples) as well as in many other branches of computer science. For some examplesof sample size calculated with concentration bounds for data mining problems, see, e.g.,[19], [32] and[37]. While these bounds usually allow us to calculate sample size needed in many situations, it is usuallythe case that resulting sample size is immense to obtain a reasonable good accuracy and con�dence.Moreover, in most of the situations, to apply these bounds, we need to assume the knowledge of certainproblem parameters that are unknown in practical applications.It is important to notice that, in the batch sampling method, the sample size is calculated a prioriand thus, it must be big enough so that it will work well in all the situations we might encounter. Inother words, the sample size provided by the above theoretical bounds for the batch sampling should bethe worst case sample size and thus, it is overestimated for most of the situations. This is one of themain reasons why researchers have found that, in practice, these bounds are overestimating necessarysample size for many non worst-case situations; see, e.g., discussion in [32] for sampling for associationrule discovery.For overcoming this problem, we propose in this paper to do sampling in an on-line sequential fashioninstead of batch. That is, an algorithm obtains examples sequentially one by one (or block by block),and it determines from those obtained examples whether it has already received enough examples for2

issuing the currently best rule as nearly the best with high con�dence. Thus, we do not �x sample sizea priori. Instead, sample size depends adaptively in the situation at hand. Due to this adaptiveness, ifwe are not in a worst case situation as fortunately happens in most of the practical cases, we may beable to use signi�cantly fewer examples than in the worst case. Following this approach, we propose ageneral algorithm | AdaSelect | for solving the General Rule Selection problem, which provides uswith an e�cient tool for many knowledge discovery applications. This general algorithm evolves fromour preliminary works on on-line adaptive sampling in [5, 6].The idea of adaptive sampling is quite natural, and various methods for implementing this idea havebeen proposed in the literature. In statistics, in particular, these methods have been studied in depthunder the name of \sequential test" or \sequential analysis". (See, for instance, the pioneer1 textbookof A. Wald [33].) Their main goal has been, however, to test statistical hypotheses. Thus, even thoughsome of their methods are applicable for some instances of the General Rule Selection problem (see,e.g., [36]), as far as the authors know, there has been no method that is as reliable and e�cient asAdaSelect for the General Rule Selection problem. More recent work on adaptive sampling comes fromthe database community [20, 21]. The problem they address is that of estimating the size of a databasequery by using adaptive sampling. While their problem (and algorithms) is very similar in spirit to ours,they do not deal with selecting competing hypotheses that may be arbitrarily close in value, and thismakes their algorithms simpler but not applicable to our problem. From the data mining community,the work of [16] and [26] is related to ours although their technique for stopping sampling is based on�tting the learning curve of some assumed learning algorithm, while our stopping condition of samplingis from purely statistical bounds. The other crucial di�erence is that their methods are used on top ofthe assumed learning algorithm, whereas our method is for a much simpler problem and it is used tospeed-up some important part, not a whole, of a data mining algorithm.The paper is organized as follows. In the next section, we explain still at some intuitive level theidea of our method and its advantage over other random sampling methods. In Section 3, we present ourproblem and algorithm, and prove theorems concerning its reliability and complexity. Some improvementsare also discussed there. In Section 4, by using one particular application for our example, we explainhow our general algorithm is instantiated and show its advantage by experimental results. We concludein Section 5 highlighting future works.2 Why Our Sampling MethodIn this section we explain still at some intuitive level the idea of our method and its advantage over otherrandom sampling methods.To illustrate the basic idea of our adaptive sampling method and discuss its di�erence from the othermethods, let us consider one very simple problem. The problem is to test whether the probability p?that a given condition C holds in a database X is more than 1=2 or not. If C held exactly in 1=2 of thetransactions in X, then we would have to check all the database to notice it; thus, let us suppose thatp? = 1=2� ? for some ?, 0 < ? � 1=2. Intuitively, if ? is large, that is, C holds on either many or veryfew transactions of the database, then just sampling a small number of transactions should be enough to�gure out that the answer is positive or negative. On the other hand, if ? is small, then testing p? > 1=2becomes di�cult and we need to see a large number of transactions. Thus, it is natural that sample1According to the Wald's textbook, the idea of a sequential test procedure goes back to H.F. Dodge and H.G. Romig[4]. 3

size, i.e., the number of examples, grows depending on 1= ?, and in fact, it can be shown that O(1= 2? )sample size is necessary. Our adaptive sampling method provides an algorithm that achieves this testwith roughly O(1= 2?) sample size. That is, it uses almost optimal sample size.For comparison, let us use the batch sampling method for this example problem. That is, werandomly choose some number of transactions, compute the ratio that C holds on them, and determinewhether p? > 1=2 or not. The sample size can be calculated by using an appropriate large deviationbound. For example, we can use the Hoe�ding bound, which has been widely used in computer science(see, e.g., [17, 24]). For any , the Hoe�ding bound tells us that O(1= 2) examples are su�cient todetect, with high con�dence, whether C holds more than 1=2 + of X or less than 1=2� . Thus, if weknow ? in advance, we can solve our problem with the optimal sample size. In most cases, however, ?is not known in advance, and we have to choose appropriate . Since the high con�dence is guaranteedonly if ? � , we have to use that is smaller than ?. But if we underestimate ?, then sample sizeO(1= 2) becomes unnecessarily large. In other words, the sample size required by the batch samplingmethod has to be always big enough to cover all possible \expected" values of ?, or one has to choose carefully, which makes sampling di�cult to use in practice. For our adaptive sampling method, on theother hand, we do not need to estimate ?; the algorithm automatically detects it and uses appropriatesample size accordingly. This is the advantage of our method over the simple batch sampling.The notion of \adaptive sampling" has been already discussed in the database context [20, 21].Roughly speaking, the following problem is discussed. Given a database and a query over the databases(for instance, a selection or a joint), we want to estimate the query size, that is, the number of transactionsassociated with the query in the database, up to a certain error and con�dence level. They designedalgorithms for this task. Below we refer to their algorithms as Adaptive Estimator. More speci�cally, fora given database X, a condition C, an accuracy parameter �, and a con�dence parameter �, AdaptiveEstimator estimates (through random sampling) the probability p? that transactions in X satisfy C upto a multiplicative error � with probability > 1 � �. That is, with probability > 1 � �, the algorithmyields estimate p of p? such that (1 � �)p? � p � (1 + �)p?. For achieving this task, Adaptive Estimatorcollects examples from the database sequentially at random while checking whether collected examplesare su�cient for terminating the execution with the current estimate. The number of examples used bythe algorithm depends on p?, �, and �, which is roughly O(1=�2p?) for some �xed �. Here again we donot need any a priori estimate of p?. The algorithm can determine appropriate sample size. Thus, thisalgorithm is very similar in spirit to ours. Notice, however, Adaptive Estimator (and any algorithm intheir method) is designed for estimating p?, but for our problem, it is necessary to estimate p? � 1=2.Moreover, the di�culty of our problem depends on ? = jp?�1=2j, i.e., the closeness of p? to 1=2, and noton p? itself. Hence, for solving our problem with Adaptive Estimator, we need to set � � ?; here againwe have to estimate ?! In a word, our method is applicable to a wider class of estimation problems, andthis example is a typical one of those problems. (It should be remarked here that Adaptive Estimatorworks better than our algorithms for estimating p?; that is, their algorithms are better for this particularproblem. See discussion in Section 3.)Now we explain intuitively why our adaptive sampling is at all possible. Here again we use the aboveexample problem, and we indeed try to solve it with the batch sampling algorithm. But instead of solvingthe problem by executing the algorithmonce, we use it for several times in the followingway. Suppose thatwe run the batch sampling algorithm with sample size calculated by the Hoe�ding bound for �1 = 1=4;that is, estimate p? within ��1. Then with high probability, the algorithm yields p1 satisfying thatp? � 1=4 � p1 � p? + 1=4. Thus, if p1 is greater than 3=4, then we can determine (with high con�dence)that p? is greater than 1=2; on the other hand, if p1 is less than 1=4, then we can again determine that4

p? is less than 1=2. However, if p1 is not in that range, that is, 1=4 � p1 � 3=4, then we cannot concludeanything about whether p? is above or below 1=2. Then we execute the algorithmwith a smaller accuracyparameter. That is, we run the algorithm again with �2 = �1=2 = 2�2. Continue this process until theobtained estimate pi after the ith iteration is not in the range [1=2�2�(i+1); 1=2+2�(i+1)]. Suppose thatthe algorithm always gives an �i-close estimate of p?. Then routine calculations show that the algorithmterminates after log(1= ?) iterations. On the other hand, we need roughly O(1=�2i ) examples for eachiteration (ignoring the factor depending on the con�dence parameter). Hence, altogether we need O(1= 2? )examples2, and we could achieve this sample size without knowing ?. This is, very roughly speaking,our adaptive sampling method. On the other hand, our algorithms are designed more carefully, whichresults in better sample complexity.This method of sampling just described is referred to in statistics as multiple sampling [2] and it wasone of the starting points for the study of the more re�ned sequential sampling methods [33].3 The Adaptive Sampling AlgorithmIn this section we formally describe the problem we would like to solve. We then present our algorithmand prove its reliability and estimate its e�ciency. We also give some modi�cations of our algorithm thatmake the algorithm e�cient asymptotically and/or practically under certain situations.3.1 General Rule Selection ProblemWe begin introducing some notation. Let X = fx1; x2; : : : ; xkg be a (large) set of examples and letH = fh1; : : : ; hng be a (�nite, not too large) set of n functions such that hi : X 7! IR. That is, h 2 H canbe thought as a function that can be evaluated on an example x producing a real value gh;x as a result.Intuitively, each h 2 H corresponds to a \rule" or \law" explaining examples, which we call below a rule,and gh;x measures the \goodness" of the rule on x. For example, if the task is to predict a particularBoolean feature of example x in terms of its other features, then we could set gh;x = 1 if the feature ispredicted correctly by h, and gh;x = 0 if it is predicted incorrectly. We also assume that there is some�xed real-valued and nonnegative utility function U (h), measuring some global \goodness" of the rule(corresponding to) h on the set X. More speci�cally, U (h) is de�ned byU (h) = F (the average value of gh;x in X);where F is some function IR 7! IR. That is, the \goodness" of h measured by the utility function U isnot just the average of gh;x, but it could be something else that is computed by F from the average ofgh;x. For the \average" of gh;x, we simply use here the arithmetic average Px2X gh;x=kXk, which wedenote avg(gh;x : x 2 X).Now we are ready to state our problem.General Rule SelectionGiven: X, H, and �, 0 < � < 1.Goal: Find h 2 H such that U (h) � (1� �) � U (h?),where h? 2 H be the rule with maximum value of U (h?).2Precisely speaking, we should take into account the fact that error may occurs at each iteration, and for reducing theprobability of such error, we need roughly O(1= 2? log(1= ?)) examples.5

For any S � X, we de�ne U (h; S) = F (avg(gh;x : x 2 S)). Then U (h) is simply U (h;X). Thus,one trivial way to solve our problem is evaluating all functions h in H over all examples x in X, hencecomputing U (h;X) for all h, and then �nding the h that maximizes this value. Obviously, if X is large,this method might be extremely ine�cient. We want to solve this task much more e�ciently by randomsampling. That is, we want to look only at a fairly small, randomly drawn subset S � X, �nd the h thatmaximizes U (h; S), and still be sure with high probability that h is close enough to the best one. (Thisis similar to the PAC learning, whose goal is to obtain a probably approximately correct hypothesis.)Remark 1: (Accuracy Parameter �)Intuitively, our task is to �nd some h 2 H whose utility is reasonably high compared with the maximumU (h?), where the accuracy of U (h) to U (h?) is speci�ed by the parameter �. Certainly, the closer U (h) isto U (h?) the better. However, depending on the choice of U , the accuracy is not essential in some cases,and we may be able to use a large �. The advantage of our algorithm becomes clear in such cases. (Seediscussion at the end of the next subsection and the application in the next section.)Remark 2: (Con�dence Parameter �)We want to achieve the goal above by \random sampling", i.e., by using examples randomly selectedfrom X. Then there must be some chance of selecting bad examples that make our algorithm to yieldunsatisfactory h 2 H. Thus, we introduce one more parameter � > 0 for specifying con�dence and requirethat the probability of such error is bounded by �.Remark 3: (Distribution on X)In order to simplify our discussion, we assume the uniform distribution over X, in which case U (h) isjust the same as U (h;X). But our method works as well on any distribution D over X so long as we canget each example independently following the distribution D. In fact, this is the case for the example wewill consider in Sectoin 4. There is no di�erence at all in such cases except that U (h) is now de�ned asU (h) = F (E[gh;x]), where E[� � �] is the expectation of � � � under the distribution that we assume over X.Remark 4: (Condition on H)In order to simplify our discussion, we assume in the following that the value of gh;x is in [0; d] for someconstant d > 0. (From now on, d will denote this constant.)Remark 5: (Condition on U )Our goal does not make sense if U (h?) is negative. Thus, we assume that U (h?) is positive. Also in orderfor (any sort of) random sampling to work, it cannot happen that a single example changes drasticallythe value of U ; otherwise, we would be forced to look at all examples of X to even approximate thevalue of U (h). Thus, we require that the function F that de�nes U is smooth. Formally, F need to bec-Lipschitz for some constant c � 0, as de�ned below. (From now on, c will denote the Lipschitz constantof F .)De�nition. 1 Function F : IR 7! IR is c-Lipschitz if for all x; y it holds jF (x)� F (y)j � c � jx� yj. TheLipschitz constant of F is the minimum c � 0 such that F is c-Lipschitz (if there is any).Observe that all Lipschitz functions are continuous, and that all di�erentiable functions with a boundedderivative are Lipschitz. In fact, if F is di�erentiable, then by Mean Value Theorem, the Lipschitzconstant of F is maxx jF 0(x)j. Also note that from the above conditions, we have 0 � U (h) � cd for anyh 2 H.Remark 6: (Minimization Problem)In some situations, the primary goal might not be to maximize some utility function over the data but6

Algorithm AdaSelect(X;H; �; �)t 0; St ;;repeatt t + 1;x randomly drawn example from X;St := St�1 [ fxg;until 9h 2 H [U (h; St) � �t � (2=�� 1)];output h 2 H with the largest U (h; St);end.Remark: Here �t is computed as follows:�t = cdpln(nt(t+ 1)=�)=(2t): (1)Figure 1: Pseudo-code of our on-line sampling algorithm AdaSelect.to minimize some penalty function P . That is, we want to �nd some h such that P (h) � (1 + �)P (h?).We can solve the General Hypothesis Selection Problem by an algorithm and analysis very similar to theone we present here.3.2 Adaptive Selection AlgorithmOne can easily think of the following simple batch sampling approach. Obtain a random sample S fromX of a priori �xed size m and output the function from H that has the highest utility in S. There areseveral statistical bounds to calculate an appropriate number m of examples. While this batch samplingsolves the problem, its e�ciency is not satisfactory because, as discussed in the previous section, it hasto choose sample size for the worst case. For overcoming this ine�ciency, we take a sequential samplingapproach. Instead of statically deciding the sample size, our new algorithm obtains examples sequentiallyone by one and it stops according to some condition based on the number of examples seen and the valuesof the functions on the examples seen so far. That is, the algorithm adapts to the situation at hand,and thus if we are not in the worst case, the algorithm would be able to realize of that and stop earlier.Figure 1 shows a pseudo-code of the algorithmwe propose, called AdaSelect, for solving the General RuleSelection problem.Statistical bounds used to determine sample size for the batch sampling still plays a key role fordesigning our algorithm. Here we choose the Hoe�ding bound. One can use any reasonable bound, butthe reason that we choose the Hoe�ding bound is that basically no assumption is necessary3 for using thisbound to estimate the error probability and calculate the sample size. On the other hand, for example,the bound from the Central Limit Theorem might be appropriate for some practical situations since itbehaves better. We will discuss this issue in the next subsection.Now we provide two theorems discussing the reliability and the complexity of the algorithmAdaSelect.For our analysis, we derive the following bounds from the Hoe�ding bound.Lemma. 2 Let S � X be a set of size t obtained by independently drawing t elements from X at random.For any h 2 H and � � 0, we have3The only assumption is that we can obtain examples independently following the same distribution, a natural assumptionthat holds for all the problems considered here. 7

Prf U (h; S) � U (h) + � g < exp��2 �2(cd)2 t� ; andPrf U (h; S) � U (h)� � g < exp��2 �2(cd)2 t� :Remark. The lemma holds even if we consider some distribution D other than the uniform distributionover X. What we need to assume is that elements of S are drawn independently according to D. (Notethat U (h) is de�ned by F (E[gh;x]); on the other hand, we use the same de�nition for U (h; S).)Proof. We prove the �rst inequality. The second one is proved symmetrically. Let g be the randomvariable avg(gh;x : x 2 S), i.e., the arithmetic average of t independent random variables. Observe thatU (h) = F (E[g]), and U (h; S) = F (g), where E[g] is g's expectation. Then, using the fact that F isc-Lipschitz, Prf U (h; S) � U (h) + � g= Prf F (g)� F (E(g)) � � g � Prf g � E(g) � �=c g= Prf g � E(g) + �=c g:Note that g is the average of t independent random variables with range bounded by d. Hence, by theHoe�ding bound, the last probability is less than exp��2 �2(cd)2 t�, as claimed. tuWe �rst prove the reliability of Algorithm AdaSelect.Theorem. 3 With probability 1 � �, AdaSelect(X;H; �; �) outputs a function h 2 H such that U (h) �(1� �)U (h?).Proof. For a �xed � > 0, de�ne Hbad = f h 2 H j U (h) < (1� �)U (h?) g. We show that the functionoutput by AdaSelect(X;H; �; �) is in Hbad with probability less than �. That is, we want to bound thefollowing error probability Perror by �.Perror = Prf AdaSelect yields some h 2 Hbad g:Here we regard one repeat-loop iteration as a basic step of the algorithm and measure the algorithm'srunning time in terms of the number of repeat-loop iterations. Note that the variable t keeps the numberof executed repeat-loop iterations. Let t0 be the integer such that the following inequalities hold. (Weassume that �0 =1.) �t0�1 > U (h?) �2 and �t0 � U (h?) �2 :Note that �t is strictly decreasing as a function of t; hence, t0 is uniquely determined. As we see below,the algorithm terminates by the t0th step (i.e., within t0th repeat-loop iterations) with high probability.For deriving the bound, we consider the following two cases: (Case 1) Some h 2 Hbad satis�es thestopping condition of the repeat-loop before the t0th step, and (Case 2) h? does not satisfy the stoppingcondition during the �rst t0 steps. Clearly, whenever the algorithm makes an error, one of the abovecases certainly occurs. Thus, by bounding the probability that either (Case 1) or (Case 2) occurs, we canbound the error probability Perror of the algorithm.First we bound that the probability of (Case 1). Let hbad be a rule in Hbad with the largest utility,and let EV (h;et) be the event that h satis�es the stopping condition of AdaSelect at the etth step. Thenwe have 8

Prf (Case 1) holds g� X1�et<t0 Xh2Hbad Prf EV (h;et) ] � X1�et<t0 n �Prf EV (hbad;et) g:Note that PrfEV (hbad;et)g is bounded by P1(et) = Prf U (hbad; Set) � �et � (2=�� 1) g. Now we wouldlike to bound this P1(et) by using Lemma 2. Note that the Hoe�ding bound and hence Lemma 2 isapplicable only for the case that the size of set S � X is �xed in advance. On the other hand, in ouralgorithm t itself is a random variable. Thus, precisely speaking, we cannot use Lemma 2 directly, and weargue as follows. First �x any et, 1 � et < t0. We modify our algorithm by replacing the stopping conditionof the repeat-loop with \t � et?"; that is, modify the algorithm so that it always sees et examples. Then itis easy to show that if U (hbad; Set) � �et � (2=�� 1) in the original algorithm, then with the same samplingsequence we have U (hbad; Set) � �et � (2=�� 1) in the modi�ed algorithm. Thus, P1(et) is at most P 01(et) =Prf U (hbad; Set) � �et � (2=�� 1) in the modi�ed algorithm g. On the other hand, since the sample size is�xed in the modi�ed algorithm, we can now use Lemma 2 to bound P 01(et). First we bound this as follows.(Here the probability is considered in the modi�ed algorithm.)P 01(et) = Prf U (hbad; Set) � �et � (2=�� 1) g= Prf U (hbad; Set) � �et � (2=�� 2) + �et g� Prf U (hbad; Set) > U (h?) �2 � (2=�� 2) + �et g= Prf U (hbad; Set) > U (h?)(1� �) + �et g� Prf U (hbad; Set) > U (hbad) + �et g;where we have used the fact that �et > U (h?)(�=2) for any et < t0 and that U (h?)(1� �) > U (hbad). Thenby Lemma 2 and by our choice of �t, we haveP 01(et) < exp �2 �2et(cd)2et! � �et(et + 1)In fact, �t is de�ned so that the last inequality holds, which point will be important in Section 3.3.Therefore, we bound Prf(Case 1) holdsg byX1�et<t0 n exp �2 �2et(cd)2et! � X1�et<t0 �et(et+ 1) � ��1� 1t0� :Next consider (Case 2). Clearly, (Case 2) implies U (h?; St0) < �t0 � (2=� � 1), which implies, as in(Case 1), that U (h?; St0) < �t0 � (2=� � 1) in a modi�ed algorithm that always sees t0 examples. Thus,the probability P2 of (Case 2) is bounded as follows. (Here again the probability is considered in themodi�ed algorithm.) P2 � Prf U (h?; St0) � �t0 � (2=�� 1) g� Prf U (h?; St0) � U (h?) � �t0 g= exp��2 �2t0(cd)2 t0� = �nt0(t0 + 1) � �t0 :In summary the probability that either (Case 1) or (Case 2) holds is bounded by �. tuNext we estimate the running time of the algorithm. As above we regard one repeat-loop iterationas a basic step of the algorithm and measure the algorithm's running time in terms of the number of9

repeat-loop iterations that is exactly the number of required examples. In the above proof, we havealready showed that the probability that the algorithm does not terminate within t0 steps, that is, (Case2) occurs, is at most �. Thus, the following theorem is immediate from the above proof.Theorem. 4 With probability 1��, AdaSelect(X;H; �; �) halts within t0 steps (in other words, AdaSelect(X;H; �; �)needs at most t0 examples), where t0 is the smallest integer such that �t0 � U (h?)(�=2).Let us express t0 in a more convenient form. Recall that t0 is the smallest integer satisfying �t �U (h?)(�=2), or equivalently, t � 2� cd�U (h?)�2 ln�nt(t+ 1)� � :Thus, we may assume that the equality holds for t0. Further by using approximation x � z lnyx(x + 1)by x � z lnyz2, we estimate t0 as follows.t0 � 2� cd�U (h?)�2 ln n� � 4� cd�U (h?)�4!= 2� cd�U (h?)�2 ��ln 4n� + 4 ln cd�U (h?)� :Let us discuss the meaning of this formula. Since both n and � are within the log function, theirin uence to the complexity is small. In other words, we can handle the case with a relatively large numberof rules and/or the case requiring a very high con�dence without increasing too much the sample sizeneeded. Or we may roughly consider that the sample size is proportional to (1=�)2 and ((cd)=U (h?))2.(Recall that U (h) � cd for any h 2 H; hence, both 1=� and (cd)=U (h?) are � 1.) Depending on thechoice of U , in some cases we may assume that (cd)=U (h?) is not so large; or, in some other cases, wemay not need small �, and thus, 1=� is not so large. AdaSelect performs particularly well in the lattercase.It should be noted here that in the case that we want to select rule depending on its success prob-ability, that is, in the case that gh;x is either 0 or 1 and U (h; S) is de�ned as avg(gh;x : x 2 S), we hadbetter use the algorithm of [21]. Of course, we could also use AdaSelect even for this case, but the samplesize is roughly O(1=(�U (h?))2), whereas their sample size is O(1=�2U (h?)).In summary, AdaSelect shows its advantage most when U is chosen so that (1) U (h) is not just thesuccess probability of h, (2) relatively large � is su�cient, and (3) though (cd)=U (h?) is not bounded ingeneral, it is not so large in lucky cases, which happen more often than the bad cases. We will see suchan example in Section 4.3.3 Some ImprovementsWe can still improve the above algorithm while keeping its outline. Here we explain two such improve-ments.Although we may roughly consider that the sample size is proportional to ((cd)=(�U (h?)))2, it iscertainly nice if we can reduce the factor ln(4n=�)+4 ln((cd)=(�U (h?))). First we try reducing this factor.Recall the probability analysis in the proof of Theorem 3. The probability of (Case 1) is boundedby the sum of nP1(et) for all et, 1 � et < t0. This is because our stopping condition is not monotonic andwe have to bound the error probability for each et, 1 � et < t0. Then this error probability bound could bereduced if we check the stopping condition less often instead of checking it every time we get an example.10

This is an idea of our �rst improvement. This modi�cation is also practically good in some cases whereit is not e�cient to draw examples one by one or it is not so easy to compute U in an incremental way.Now let us state our modi�cation precisely and discuss how it works. For a given parameter s, wemodify the algorithm so that the stopping condition is checked only at the skth step for each k � 0. (Ingeneral, s may not be an integer, in which case, sk means bskc.) In other words, the repeat-condition isreplaced with \9k � 0 [t = sk] and 9h 2 H [U (h; St) � �t � (2=�� 1)] ?". (Of course, we do not have todraw one by one nor to compute U or �t if t = sk does not hold for any k.) This reduces the number ofoccasions that some undesired hypothesis is selected during the �rst t steps from t to logs t. Due to thisreduction, we can use the following �t for t = sk.�t = cdpln(nk(k + 1)=�)=(2t): (2)Let us refer to this modi�ed algorithm as the geometric AdaSelect, whereas the original one as thearithmetic AdaSelect. For this modi�ed algorithm, we can prove the following theorem.Theorem. 5 With probability 1��, the geometric AdaSelect, running on (X;H; �; �), outputs a functionh 2 H such that U (h) � (1� �)U (h?).Proof. The argument is essentially the same as Theorem 3. The main di�erence is to use a di�erentvalue for dividing (Case 1) and (Case 2). That is, instead of t0, we use t1 that is de�ned as t1 = sk1 withan integer k1 satisfying �sk1�1 > U (h?) �2 and �sk1 � U (h?) �2 :Then we can analyze the probability of (Case 2) in exactly the same way as Theorem 3. For (Case1), we only have to sum up nP1(et) for all et satisfying et = sk for some k < k1. Then it is easy to see therest of the proof works. tuSimilarly, we also have the following analysis.Theorem. 6 With probability 1 � �, the geometric AdaSelect, running on (X;H; �; �), halts within t1steps, where t1 is the smallest integer such that t1 = sk1 for some k1 and �t1 � U (h?)(�=2).For comparing t0 and t1 to see how sample size is changed, we �rst estimate t1 in a simpler form.By de�nition of t1, we may assume thatt1 � 2s� cd�U (h?)�2 ln�nk1(k1 + 1)� � :Here the new factor s comes in because it may be the case that �sk1�1 = U (h?)(�=2)� 1. In other words,the factor s is what we have to pay for by modifying the algorithm to check the stopping condition lessoften.As before we obtain the following approximation from the above.t1 � 2s� cd�U (h?)�2 ln0@n� � logs 2s� cd�U (h?)�2!21A<� 2s� cd�U (h?)�2 ln n� � 4�logs cd�U (h?)�2 � (logs 2s)2!� 2s� cd�U (h?)�2 �ln 4n� + 2 ln logs cd�U (h?) + 2 ln logs 2s� :11

Comparing this estimation with the previous one for t0, we can see that t1 is asymptotically betterthan t0 for any �xed s. In actual applications, we can often improve the sample size by choosing sappropriately. See the next section for such example.Next we consider using some other statistical bounds instead of the Hoe�ding bound. In particular,we consider a bound from the Central Limit Theorem because it is again widely applicable and it givesa quite good approximation.We �rst recall the Central Limit Theorem. Let g be the average of t independent and identicallydistributed random variables g1; :::; gt that take a value in some closed interval, and let E and V be theexpectation and variance of each gi respectively. (Then we can calculate the expectation and the varianceof g as E[g] = E and V[g] = V=t.) The Central Limit Theorem claims that (g � E[g])=pV [g] convergesto the normal distribution when t goes in�nity. More speci�cally, if t is large enough, then we may beable to use the following approximation.Pr( g � E[g]pV[g] � a ) = Pr( g � EpV=t � a ) � �(a);where �(a) = (1=p2�) R a�1 e�x2=2dx is the normal distribution function. We refer this approximation inthe following as the Central Limit Approximation. It is folklore in statistics that if t exceeds 30, then wecan safely assume that (g�E[g])=pV[g] is close enough to the normal distribution for using the CentralLimit Approximation.While �(a) and ��1(b) are not so di�cult to compute, it is still convenient if we can state ourbound without using �, and it is particularly helpful for comparing the bound from the Central LimitApproximation with the Hoe�ding bound. Fortunately, we have 1 � �(x) � exp(�x2=2)=(xp2�) andthis is a reasonably close bound [11]. Thus we use this to derive the following bound, which we call theCentral Limit Approximation Bound. (Recall that �(�a) = 1� �(a). Then the derivation is clear.)Prf g � E + � g � Prf g � E � � g <� pV�p2�t exp�� 22V t� :By using this bound, we can show a lemma corresponding to Lemma 2.Lemma. 7 Let S � X be a set of size t obtained by independently drawing t elements from X at random.Let V be the variance of gh;x. For any h 2 H and � � 0, if t is large enough to use the Central LimitApproximation, then we havePrf U (h; S) � U (h) + � g � Prf U (h; S) � U (h) � � g<� cpV�p2�t exp�� 22c2V t� :In particular, if we assume that V � b2 for some b > 0, then we havePrf U (h; S) � U (h) + � g � Prf U (h; S) � U (h) � � g<� bc�p2�t exp�� 22(bc)2 t� :Now we modify our algorithm based on this lemma. First consider the original, i.e., the arithmeticAdaSelect. Recall the remark on �t in the proof of Theorem 3. We de�ned �t so that the probabilitygiven by the Hoe�ding bound is bounded by �=(nt(t+ 1)). Thus, with our new bound, we need to de�ne�t so that the following inequality holds. 12

bc�tp2�t exp�� 2t2(bc)2 t� � �nt(t+ 1) :To de�ne �t more explicitly, we let �t = � � (bcp2=t). Then the above inequality is equivalent to� exp(�2) � � , where � = nt(t + 1)=(2�p�). Thus, the inequality holds if � � pln � � (ln ln � � 1)=2,which gives the following de�nition of �t.�t = bcp(2 ln � � ln ln � + 1)=t; where � = nt(t+ 1)=(2�p�). (3)Then it is clear from our discussion that a theorem corresponding to Theorem 3 holds for this choice of�t. Similarly, we can de�ne �t for the geometric AdaSelect; we only need to change t in � to k.Also we can analyze the e�ciency of these modi�ed algorithms in the same way as Theorem 4. Forthe arithmetic version, we get the following sample size bound t00 by letting �t00 � U (h?)�=2.t00 � 8�2� bc�U (h?)�2 = 8� bc�U (h?)�2 ��ln � � ln ln � � 12 � ;where � = nt00(t00 + 1)=(2�p�). Similarly we have, for the geometric version,t01 � 8s� bc�U (h?)�2 ��ln � � ln ln � � 12 � ;where � = n logs t01(logs t01 + 1)=(2�p�). Though asymptotically these bounds are almost the same asbefore, the actual values could be much smaller in many practical situations. We will compare thesebounds considering one typical example in the next section.4 An Application of AdaSelect: A Case StudyIn this section we describe in detail how our method can be instantiated into a particular applicationand provide an experimental evaluation of the algorithms proposed in Section 3. Though we concentrateon one application example, the possibility of some other applications is also discussed at the end of thissection.4.1 Some Remarks on Applications and Our ExampleBefore describing our application, let us make some remarks about the kind of problems where ouralgorithms can be applied. First, recall that our algorithms are designed for the General Rule Selectionproblem described in Section 3. This problem is simple; it may be too simple to capture any sort ofactual data mining problem. We do not propose to use our sampling algorithms for solving some datamining problems, but instead, we do propose to use them as a tool. As explained below, some subtasksof data mining algorithms can be casted as instances of the General Rule Selection problem and then,those parts can be scaled by using our sampling algorithms instead of all the data. Also as remarkedin the previous sections, our method is not the one that we should always use. It shows its advantagesmost when an appropriate utility function U can be used, and there are some cases, though only veryparticular cases, that the other method performs better.The example we choose is a boosting based classi�cation algorithm that uses a simple decision stumplearner as a base learner, and our sampling algorithms are used to scale-up the base learner. Boosting13

[31, 12] is a technique to make a su�ciently \strong" learning algorithm based on a \weak" learningalgorithm. Since a weak base learning algorithm (a base learner, in short) does not need to obtain ahighly accurate classi�cation rule, we can use for a base learner a learning algorithm that produces a not-so-accurate but simple classi�cation rule (which we call a weak hypothesis). Then since a set H of suchsimple weak hypotheses is not so large, we may be able to select the best one just by searching throughH exhaustively, which is indeed one instance of our General Rule Selection problem. That is, any ruleselection algorithm can be used for a base learner, and we do this by random sampling. Here we use a setHDS of decision stumps for our weak hypothesis class, where a decision stump is a single-split decisiontree with only two terminal nodes. We consider the case that attributes used for a split node are discreteBoolean attributes and that the number of all possible decision stumps is not so large. On the otherhand, a set X of examples is huge and it is just infeasible to see all examples in X. In such a situation,the best possible approach is the selection via random sampling, i.e., selecting a weak hypothesis fromHDS that performs the best classi�cation on randomly selected examples from X. We use our samplingalgorithms for this selection.Here we had better explain why this particular example is chosen. Weak hypothesis selection wasthe initial target of our series of research [5] and we have chosen this example because it is the easiest butquite illustrative application of AdaSelect that reveals why the other similar random sampling methodsare not suitable (See the discussion about the utility function in the next subsection.). It also showsa fundamental di�erence between our method and other sampling methods described in the knowledgediscovery literature like [16, 26]. These methods are based on learning curves of the induction algorithmused as a black box. Roughly speaking, these methods run the overall learning algorithm (which itselfcorresponds to executing enough number of boosting iterations) several times by increasing sample size,produce \strong" hypotheses, and determine the point where the accuracy of obtained hypotheses sat-urates. That is, these methods should be used on top of some well-designed learning algorithm, whileours are used as a part of some learning algorithm. Finally, the choice of boosting has been motivatedfor its current success. Several recent papers [14, 3, 28, 1] report very extensive experiments showing theadvantage of boosting against other induction methods, but to the authors' knowledge, there has notbeen much research reported on how to use it on very large datasets.We should also clarify the type of boosting technique we use here. For the boosting part, the obviouschoice would be the AdaBoost algorithm of [13] since this algorithm has been repeatedly reported tooutperform any other voting method [14, 28, 3, 1]. However, AdaBoost has the following problem thatmakes it not suitable for being combined with our sampling method. AdaBoost is designed for boostingby re-weighting where its base learner is required to produce a hypothesis that tries to minimize error withrespect to a weighted training set. The training set is �xed at the beginning and used throughout all theboosting steps with AdaBoost modifying the weights at each iteration depending on the hypotheses beingobtained. However, we are aiming to use our sampling method to speed up the base learner by changingthe size of a training set adaptively. Thus, we cannot use AdaBoost in a straightforward manner. Insteadof �xing a training set, we can draw random examples, �lter them according to their current weights, andpass some of them to the base learner. This is what is usually called \boosting by �ltering" or \boostingby re-sampling". Notice that in this way we can actually have an algorithm generating instances underthe distribution with which the base learner is required to work. Unfortunately, though, the weightsproduced by AdaBoost cannot be used for this since some of the weights become quite large in �rstseveral steps. The obvious solution is to normalize them so that their sum up becomes 1 and thus theycan be used for the �ltering probability. However, to normalize the weight we have to go through all thedataset; hence, the advantage of using sampling is simply lost. Moreover, some empirical evidence shows14

that AdaBoost with re-sampling does not work well [28]. This problem of AdaBoost has been recentlyaddressed by Domingo and Watanabe [35, 8]. For overcoming this problem, they proposed MadaBoost, asimple modi�cation of AdaBoost that is suitable for boosting both by re-weighting and re-sampling whilekeeping the desired boosting property. (For more details on MadaBoost we refer the reader to [8].) In ourexperiments, we use MadaBoost and do boosting by re-weighting and re-sampling. Thus, we can assumethat a base learner can draw examples under an appropriately modi�ed distribution at each boostingstage.4.2 The Description of the Problem and the AlgorithmHere we specify our example problem in detail and instantiate AdaSelect for this problem.As we have already explained, we use our sampling algorithm for a base learner used by the boostingalgorithm MadaBoost. Our task is to select, from a weak hypothesis class, a \good" hypothesis thatperforms nearly the best classi�cation on randomly selected examples from X. Here note that theboosting algorithm modi�es a distribution over X at each boosting step, and the goodness of hypotheseshas to be measured based on the current distribution. But on the other hand, we can assume someexample generator EX, provided by the boosting algorithm, that generates x from X according to thecurrent distribution.Now we de�ne each item of the General Rule Selection problem speci�cally. We assume some set ofdiscrete attributes. Each example is regarded as a vector of the values of each attribute on the instanceplus its class. Thus, set X contains all such labeled examples. The class H of rules is HDS, the set of allpossible decision stumps over the set of discrete attributes, and a decision stump is used as a classi�cationrule. For each h 2 HDS and each x = (a; b) 2 X (a being the example and b its class), we de�ne gh;x = 1 ifx is correctly classi�ed by h (that is, h(a) = b) and gh;x = 0 otherwise. That is, gh;x indicates \goodness"of h on x. Boosting techniques require a base learner to provide a weak hypothesis that performs betterthan random guess. Since they were originally designed for binary classi�cation the worst classi�er isassumed to have accuracy equal to 50%. (Notice that for non-binary classi�cation problems, achievingsuch accuracy is not a trivial task at all.) A weak hypothesis with error less than 50% would force theboosting process to stop. Thus, in order to make sure we get a weak hypothesis with prediction error lessthan 1=2, we should measure the goodness of a hypothesis by its \advantage" over the random binaryguess, i.e., its correct probability � 1=2. Here it is quite easy to adjust our goodness measure. Whatwe need is to de�ne the utility function U by U (h) = E[gh;x]� 1=2; that is, we use F (y) = y � 1=2 tocalculate our total goodness of h from the average goodness of h. Accordingly, U (h; S) is de�ned byU (h; S) = avg(gh;x : x 2 S) � 12 = Px2S gh;xkSk � 12 :This is the speci�cation of the problem that we want to solve. One important point here is that wedo not have to worry so much about the accuracy parameter �, which is due to our choice of the utilityfunction. For example, we can �x � = 1=2. Then our sampling algorithm may choose some h whoseutility U (h) is U (h?)=2, just the half of the best one. But we can still guarantee that the misclassi�cationprobability of h is smaller than 1=2, in fact, it is 1=2�U (h?)=2, whereas that of the best one is 1=2�U (h?).On the other hand, if we measured the goodness of each hypothesis by its correct classi�cation probability,then we would have to choose � small enough (depending on the best performance) to ensure that theselected hypothesis is better than the random guess. By introducing the notion of \utility", we coulddiscuss general enough selection problems that allow us to attack problems like this example easily.Next we de�ne some other parameters and instantiate our algorithm. Note that the function F usedhere to de�ne U from avg(gh;x : x 2 S) is just F (y) = y� 1=2; hence, it is 1-Lipschitz and the parameter15

Algorithm Decision Stump Selector % We set � = 0:5 and � = 0:1.t 0; S ;;repeatuse EX to generate one example and add it to S;t t + 1;U (h; S) kfx 2 S : h classi�es x correctlygk=t� 1=2;until (9h 2 H[ U (h; S) � �t(2=�� 1) ]) or (t > LARGE NUM )if t � LARGE NUM thenoutput h0 2 H with largest U (h; S);output U (h0; S) + 1=2 as an estimation of h0's success prob.;elseexit and abort MadaBoost;end.Remark: Here based on (3), �t is computed as follows:�t = p(2 ln � � ln ln � + 1)=t; where � = nt(t+ 1)=(2�p�). (4)Figure 2: The arithmetic version of Decision Stump Selector using the CLA bound.c is set to 1. Also since gh;x is either 0 or 1, the parameter d is set to 1. We use the Central LimitApproximation bound because it gives better sample size and the number of examples is large enoughto assume that the Central Limit Approximation holds. For using this bound, we need a bound b forV[gh;x]. Note that V[gh;x] = ph(1 � ph), where ph is the probability that h classi�es x 2 X correctlyunder the assumed distribution over X. Hence, V[gh;x] � 1=4, and we use b = 1=4. With all necessaryparameters being �xed, we can now state our algorithm as in Figure 2. (Here we state the arithmeticversion. For the geometric version, we set the parameter s = 2 because with this setting, the sample sizet01 is smaller than t00 for any reasonable situation.)Two additional remarks on this algorithm. First, we modi�ed the algorithm to output not only thehypothesis h0 but also an estimation of the correct classi�cation probability of h0, because it is necessaryin the boosting algorithm. Secondly, in real applications, it sometimes occurs that there is no appropriateweak hypothesis in the class, in which case our original algorithm goes into the in�nite loop. To avoid thissituation, we added a \sanity bound" t > LARGE NUM to the repeat-loop condition. If this bound ismet, then it means that (with high probability) the best hypothesis in the class has an accuracy smallerthan ALMOST HALF and thus, the overall boosting process must be stopped here because no moreimprovement is hoped beyond this stage. For our experiment, we set ALMOST HALF = 0:505; thenthe Hoe�ding bound gives LARGE NUM = 92103. This bound has not been reached in any run of ourexperiment.4.3 Experimental ResultsHere we provide some experimental results for investigating the performance of our Decision StumpSelector.We have conducted our experiments on a collection of datasets from the repository at University ofCalifornia at Irvine [18]. Two points need to be clari�ed on the way we used these datasets.Firstly, some attributes in these datasets are continuous, while our learning algorithm is designed16

for discrete Boolean attributes. Hence, we needed to discretize some of the data. For a discretizationalgorithm, we have used equal-width interval binning discretization with 5 intervals. Although thismethod has been shown to be inferior to more sophisticated methods like entropy discretization [10], itis very easy to implement, very fast, and the performance di�erence is small [9]. Missing attributes arehandled by treating \missing" as a legitimate value.Secondly, we had to in ate the datasets since we need large datasets but most of the UCI datasets arequite small. That is, following [16], we have arti�cially in ated them introducing 100 copies of all recordsand randomizing their order. Datasets Shuttle and Adult already have a large number of examples sothose two datasets have been used in its original form. We are aware that this is perhaps not the bestmethod to test our algorithms and real large datasets would obviously have been better; but we stillbelieve that the results are informative and convincing enough to show the goodness of our method.We have chosen datasets with a small number of classes (2 or 3), except one exception, since,according to previous experiments on boosting stumps, we are not expecting our base learner to be ableto �nd weak hypothesis with accuracy better than 50% for problems with a large number of classes.Apart from this restriction, the choice of the datasets has been done so it re ects a variety of datasetssizes and combination of discrete and continuous attributes.For every dataset, we have used a 10-fold cross validation if a test set is not provided and forthe algorithms using sampling (and thus, randomized) the results are averaged over 10 runs. All theexperiments have been done in a computer with a CPU alpha 600Mhz using 256Mb of memory and ahard disk of 4.3Gb running under Linux. Since enough memory was available, all the data has beenloaded in a table in main memory and from there the algorithms have been run. Loading the data tookfew seconds and since this time is the same for all the algorithms it has been omitted from the results.Notice that for the algorithms not using sampling, this is the only way to run them. For the algorithmsusing sampling, it could have been done from disk using an e�cient sampling method from externalmemory. Doing experiments under these conditions is part of our future work.The experiments were done for estimating the performance of the arithmetic and geometric versionsof Decision Stump Selector that are used by MadaBoost for boosting by �ltering. For comparison, weexecuted MadaBoost for boosting by re-weighting with the whole dataset; that is, MadaBoost with thebase learner that selects the best decision stump by searching trough the whole dataset exhaustively.Below we use Ar., Ge., and Exh. respectively to denote the arithmetic and the geometric versions ofDecision Stump Selector and the exhaustive search selector. We also carried out the experiments withExh. using the original AdaBoost and found the di�erence with MadaBoost with Exh. almost negligible.We have set the number of boosting rounds to be 10 which usually is enough to converge to a �xedtraining error (that is, although we keep obtaining hypothesis with accuracy slightly better than 50%,the training error does not get reduced anymore).Table 1 shows the accuracy obtained on these datasets by three combinations of selectors withMadaBoost. The columns \Size" and \kHDSk" show, for each dataset, the number of examples in it andthat of all possible decision stumps respectively. As we can see easily, there is no signi�cant di�erencebetween the accuracies obtained by these three methods. These results indicate that our samplingmethodis accurate enough.Once we have established that there is no loss in accuracy due to the use of sampling, we shouldcheck whether there is any gain in e�ciency. Table 2 shows the running times of MadaBoost combinedwith three selection algorithms, exhaustive one (Exh.), and the arithmetic (Ar.) and the geometric (Ge.)versions. We have also provided the running time of C4.5 for those datasets. The reason for doing that isthat boosting is usually a slow process and thus, even though we are reducing its running time by using17

Name Size kHDSk Exh. Ar. Ge.agaricus 812400 296 97.74 97.84 98.03kr-vs-kp 319600 222 93.19 92.89 92.71splice 319000 3240 85.15 84.92 83.71german 100000 222 74.10 74.28 74.30ionos 35100 408 90.26 89.53 89.63shuttle 43500 2268 93.34 93.34 93.34adult 30162 286 82.47 82.33 82.22Table 1: Accuracies of boosted decision stumps with and without sampling.Name Size kHDSk Exh. Ar. Ge. C4.5agaricus 812400 296 892.31 2.07 1.78 21.65kr-vs-kp 319600 222 265.63 3.68 2.75 31.13splice 319000 3240 3036.00 235.56 159.54 49.76german 100000 222 80.75 16.96 10.47 20.34ionos 35100 408 56.95 6.29 4.74 29.47shuttle 43500 2268 238.57 43.71 33.48 9.32adult 30162 286 30.49 15.14 10.02 12.93Table 2: Running times (in seconds) of MadaBoost with and without sampling, and that of C4.5.sampling, we still want to know how fast/slow is this running time compared to the running time of anstate of the art learning method like C4.5.Let us discuss about these results. First, one can easily see that using all the dataset is a very slowprocess, particularly for large and complex datasets that need a large number of decision stumps like, forinstance, dataset Splice. The running time of MadaBoost with Exh. is a function of the dataset size, thenumber of decision stumps considered (which depends on how many attributes the dataset has and theirrange of values), and the number of boosting rounds. Notice that even though dataset Agaricus is morethan twice bigger than Splice, its decision stumps size kHk is less than 1=10 of Splice, and this results ina faster running time.For the algorithms using sampling, we can see that the running time has been all greatly reduced.For instance, for MadaBoost with the arithmetic version, the running time is, on average, reduced aboutby 1/94 of the Exh. case when including Agaricus, which is a particularly good case where the runningtime is reduced by 1/446; even if we does not count Agaricus, the average reduction of running time isabout 1/36. Now the slowest dataset is still Splice but it takes less than 4 minutes. Surprisingly enough,for the sampling versions the fastest dataset becomes the largest one, Agaricus. It is due to the particularstructure of this dataset. First, the decision stump size kHk is very small. Secondly, during all the 10boosting iterations one can �nd hypothesis with accuracy larger than 70% and thus, the sample sizesneeded are very small. This contrasts with datasets like German where a similar number of decisionstumps is considered and, even though the dataset is less than 1/8 of Agaricus, the running time onGerman is 8 times larger. This is because for this dataset, after the third boosting iteration, even thebest stump has accuracy smaller than 60% and this a�ects the e�ciency of the sampling method.To further understand these di�erences we have provided in Table 3 the results concerning the datausage of the sampling versions of our algorithm. Column \Sampled" shows the total number of examplessampled during the 10 boosting iterations, and column \Used" shows the total number of examples18

Dataset Ar. Ge.Name Size kHDSk Sampled Used Sampled Usedagaricus 812400 296 82915 13907 75573 11968kr-vs-kp 319600 222 65460 36150 49972 27776splice 319000 3240 299624 225478 205023 155392german 100000 222 264763 162058 179088 109184ionos 35100 408 78822 35395 58260 26688shuttle 43500 2268 221193 76503 173156 54400adult 30162 286 270454 118247 181640 81152Table 3: Data usage results for boosted decision stumps with sampling.actually used by Decision Stump Selector. (Recall that we are using boosting by �ltering; so whenDecision Stump Selector asks for an example, the �ltering method might have to actually sample severalof them until to get one that passes the �lter.) There, the di�erences in the running time on datasetslike Agaricus and German can be easily understood since one can see that the number of examples usedin Agaricus is less than 1/10 of that of examples used in German.As for the di�erence between arithmetic and geometric sampling, we can see that, for this particularproblem and our choice of s = 2, the geometric sampling is, on average, around 1:3 times faster than thearithmetic sampling, which is what we can expect from our theoretical estimation.With respect to C4.5, we can see that our algorithm is faster on many datasets. More speci�cally,for the datasets Shuttle and Splice, C4.5 is more or less 3 times faster. For the other datasets, on theother hand, MadaBoost with the geometric sampling is around 6.5 times faster than C4.5.4.4 Other ApplicationsWe have seen in detail how our method can be instantiated into a particular application. Here wereview some other data mining algorithms where our sampling method is potentially applicable. Weremark here how one might try to apply the method and which problems you might encounter. Each ofthis application requires an individual deep study and experimentation in order to determine whether it isfeasible or not. Our purpose in this section is just to encourage further study about how to apply adaptivesampling methods to these techniques since we believe that it might lead to successful applications.Association rules.-We already studied this problem in [6] where our algorithmwas used to estimatethe support of the rules through sampling. There, the distance of the support of a rule being estimatedto the minimum support was used as utility function. Although preliminary experiments reported therewere very encouraging there is still further work to integrate our method in an overall algorithm formining association rules like Apriori.Naive Bayes.- This induction system seems very suitable for our algorithm. It works by creatinga set of discriminant functions, one for each class which are just certain probabilities obtained fromthe training set. However, this application is not as straightforward as it seems. One should carefullydetermine what will be the most appropriate choice for the utility function and also we should decidehow to deal with of some probabilities having 0 or very small value. Again, we can see here that ourmethod will provide a sampling procedure not for the overall algorithm as it was done in [16] but for asubprocedure of it, for example, one for every discriminant function.Decision Tree Induction.- This application is the most interesting because decision trees have19

been widely used and decision tree representation appeals to human interpretation. For this problem,our sampling method might be used to decide, every time a new node is being constructed, on whichattribute we should split. Intuitively, the splitting criteria should be used as the utility function since itis a function on an average over the dataset. It is also clear that one might not need to go trough allthe data to decide which split is the best (or close to the best) and for this situation is for which oursampling method might be useful. However, this application is far from trivial due to the complexity ofsome commonly used splitting functions, like the gain ratio, and due to the fragmentation problem. Eventhough the amount of data falling at nodes close to the root might be large enough to use sampling todecide which split to use, the larger we build the tree the smallest becomes the amount of data availableat each node. Thus, the e�ectiveness of sampling might be lost.5 Concluding RemarksWe have presented a new methodology for sampling that, while keeping all the theoretical guaranteesof previous ones, is applicable in a wider setting and moreover, it is very likely that it is useful inpractice. The key point is to perform sampling sequentially and determine the time to stop sampling bya carefully designed stopping condition. We theoretically give both justi�cation and e�ciency analysisof our algorithms, and then show some preliminary experimental results illustrating the advantage ofour method. In order to give some concrete example, we �x one speci�c problem, i.e., the design of abase learner for a boosting algorithm, and discuss how our method can be used. But this is not a singleexample, and there are many applications of our method where it seems very plausible that it will succeedas discussed in Section 4.4 and it is our future work to study them.AcknowledgementsWe would like to thank Heikki Mannila for pointing us the work on adaptive sampling for database queryestimation and for encouraging us to follow up our previous research on adaptive sampling. We wouldalso like to thank Chris Watkins for telling us about Hoe�ding races, Pedro Domingos for pointing usto several machine learning papers related to this work, and Tadashi Yamazaki for technical help whiledoing the experiments.References[1] Bauer, E. and Kohavi, R. 1998. An Empirical Comparison of Voting Classi�cation Algorithms: Bag-ging, Boosting, and Variants. Machine Learning, 1{38, 1998.[2] Bartky, W., 1943. Multiple Sampling with Constant Probability. The Annals of Mathematical Statis-tics, 14:363{377.[3] Dietterich, T., 1998. An Experimental Comparison of Three Methods for Constructing Ensembles ofDecision Trees: Bagging, Boosting, and Randomization.Machine Learning, 32:1{22.[4] Dodge, H.F. and Romig, H.G., 1929. A Method of Sampling Inspection. The Bell System TechnicalJournal, 8:613{631. 20

[5] Domingo, C., Gavald�a, R. and Watanabe, R., 1998. Practical Algorithms for On-line Selection. Pro-ceedings of the First International Conference on Discovery Science, DS'98. Lecture Notes in Arti�cialIntelligence 1532:150{161.[6] Domingo, C., Gavald�a, R. and Watanabe, O., 1999. On-line Sampling Methods for Discovering Asso-ciation Rules. Tech Rep. C-126, Dept. of Math and Computing Science, Tokyo Institute of Technology.[7] Domingo, C., Gavald�a, R. and Watanabe, O., 1999. Adaptive Sampling Methods for Scaling UpKnowledge Discovery Algorithms. To appear in Proceedings of the Second International Conference onDiscovery Science.[8] Domingo, C. and Watanabe, O., 1999. A modi�cation of AdaBoost: a preliminary report.Tech Rep. C-133, Dept. of Math and Computing Science, Tokyo Institute of Technology.(www.is.titech.ac.jp/research/research-report/C/index.html).[9] Dougherty, J., Kohavi, R., and Sahami, M., 1995. Supervised and Unsupervised Discretization ofContinuous Features. Proceedings of the Twelfth International Conference on Machine Learning.[10] Fayad, U.M. and Irani, K.B., 1993. Multi-interval discretization of continuous-valued attributes forclassi�cation learning. Proceedings of the 13th International Joint Conference on Arti�cial Intelligence,pp. 1022-1027.[11] Feller, W., 1950. An introduction to probability theory and its applications. John Willey and Sons.[12] Freund, Y., 1995. Boosting a weak learning algorithm by majority. Information and Computation,121(2):256{285.[13] Freund, Y., and Schapire, R.E., 1997. A decision-theoretic generalization of on-line learning and anapplication to boosting. Journal of Computer and System Sciences, 55(1):119{139.[14] Freund, Y., and Schapire, R.E., 1997. Experiments with a new boosting algorithm. Proceedings ofthe 13th International Conference on Machine Learning, pages 148{146. Morgan Kaufmann.[15] Holte, R.C., 1993. Very simple classi�cation rules perform well on most common datasets. MachineLearning, 11:63{91.[16] John, G. H. and Langley, P., 1996. Static Versus Dynamic Sampling for Data Mining. Proc. of theSecond International Conference on Knowledge Discovery and Data Mining, AAAI/MIT Press.[17] Kearns, M. J. and Vazirani, U., 1994.An Introduction to Computational Learning Theory. CambridgeUniversity Press.[18] Keogh, E., Blake, C. and Merz, C.J., 1998. UCI repository of machine learning databases,[http://www.ics.uci.ed u/ mlearn/MLRepository.html]. Irvine, CA: University of California, Depart-ment of Information and Computer Science.[19] Kivinen, J. and Mannila, H., 1994. The power of sampling in knowledge discovery. Proceedings ofthe ACM SIGACT-SIGMOD-SIGACT Symposium on Principles of Database Theory, pp.77{85.[20] Lipton, R. J. and Naughton, J. F., 1995. Query Size Estimation by Adaptive Sampling. Journal ofComputer and System Science, 51:18{25. 21

[21] Lipton, R. J., Naughton, J. F., Schneider, D. A. and Seshadri, S., 1995. E�cient sampling strategiesfor relational database operations. Theoretical Computer Science, 116:195{226.[22] Maron, O.and Moore, A. W., 1994. Hoe�ding races: Accelerating model selection search for classi-�cation and function approximation. Advances in Neural Information Processing Systems, 6:59{66.[23] Moore, A. W. and Lee, M. S., 1994. E�cient algorithms for minimizing cross validation error. Proc.of the 11th Int. Conference on Machine Learning, pp. 190{198.[24] Motwani, R. and Raghavan, P., 1997. Randomized Algorithms. Cambridge University Press.[25] Musick, R., Catlett, J. and Russell, S., 1993. Decision Theoretic Subsampling for Induction on LargeDatabases. Proceedings of the 10th International Conference on Machine Learning, pp.212{219.[26] Provost, F., Jensen, D. and Oates, T., 1999. E�cient Progressive Sampling. Proceedings of the 5thInternational Conference on Knowledge Discovery and Data Mining.[27] Provost, F. J. and Kolluri, V., 1998. A Survey of Methods for Scaling Up Inductive Learning Algo-rithms. Data Mining and Knowledge Discovery, to appear.[28] Quinlan, J. R., 1996. Bagging, Boosting and C4.5. Proceedings of the Thirteenth National Conferenceon Arti�cial Intelligence, AAAI Press and the MIT Press, pp. 725{723.[29] Quinlan, J. R., 1993. C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, Cali-fornia.[30] Salzberg, S. L., 1997. On Comparing Classi�ers: Pitfalls to Avoid and a Recommended Approach.Data Mining and Knowledge Discovery, 1,317{327.[31] Schapire, R. E., 1990. The strength of weak learnability. Machine Learning, 5(2):197{227.[32] Toivonen, H., 1996. Sampling large databases for association rules. Proceedings of the 22nd Inter-national Conference on Very Large Databases, pages 134{145.[33] Wald, A., 1947. Sequential Analysis. Wiley Mathematical, Statistics Series.[34] Wang, M., Iyer, B., and Vitter, J. S., 1998 MIND: A Scalable Mining for Classi�er in RelationalDatabases. In Proceedings of the ACM SIGMOD Workshop on Research Issues on Data Mining andKnowledge Discovery, DMKD.[35] Watanabe, O., 1999. From Computational Learning Theory to Discovery Science. Proc. of the 26thInternational Colloquium on Automata, Languages and Programming (invited talk), Lecture Notes inComputer Science 1644:134{148.[36] Wetherill, G.B., 1975. Sequential Methods in Statistics (Second Edition), Chapman and Hall.[37] Wrobel, S., 1997. An algorithm for multi-relational discovery of subgroups. Proceedings of the FirstEuropean Symposium on Principles of Data Mining and Knowledge Discovery, pp.78{87.22

Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Documents