Rule Generalization Strategies in Incremental Learning of Disjunctive Concepts Stefano Ferilli, Andrea Pazienza, Floriana Esposito [email protected]Dipartimento di Informatica Centro Interdipartimentale per la Logica e sue Applicazioni Università di Bari 9th International Web Rule Symposium (RuleML) August 2-5, 2015 – Berlin, Germany
28
Embed
RuleML2015: Rule Generalization Strategies in Incremental Learning of Disjunctive Concepts
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
● Symbolic knowledge representations● Mandatory for applications that
– Reproduce the human inferential behavior– May be required to explain their decisions in human-
understandable terms● 2 kinds of concept definitions
– Conjunctive: a single definition accounts for all instances of the concept
– Disjunctive: several alternate conjunctive definitions (components)
● Each covers part of the full concept extension● Psychological studies have established that capturing and
dealing with the latter is much harder for humans● Pervasive and fundamental in most real-world domains
Introduction
● Knowledge acquisition bottleneck● (Symbolic) Machine Learning (ML) systems
– Supervised setting: concept definitions inferred from descriptions of valid (positive) or invalid (negative) instances (examples)
● Batch: whole set of examples available (classical setting in ML)– Components learned by progressive coverage strategies– Definitions are immutable
● If additional examples are provided, learning must start from scratch considering the extended set of examples
● Incremental: new examples may be provided after a tentative definition is already available
– Components emerge as long as they are found– Definitions may be changed/refined if wrong
Motivations
● Incremental approach– If the available concept definition cannot properly account
for a new example, it must be refined (revised, changed, modified) so that the new version properly accounts for both the old and the new examples
– Progressive covering strategy not applicable● Issue of disjunctive definitions becomes particularly relevant
– When many components can be refined, there is no unique way for determining which one is most profitably refined
– Refining different components results in different updated definitions, that become implicit constraints on how the definition itself may evolve when additional examples will become available in the future
● The learned model depends– on the order in which the examples are provided– on the choice of the component to be refined at each step
Motivations
● Abstract Diagnosis: Identification of the part of the theory that misclassifies the example– Tricky in the case of disjunctive concepts
● Several alternate conjunctive definitions are available– If a positive example is not covered, no component of the
current definition accounts for it (omission error)● All candidate to generalization
– Generalizing all components, each component would account alone for all positive examples
● Contradiction: the concept is conjunctive● Over-generalization: the theory would be more prone to
covering forthcoming negative examples– Problem not present in batch learning
Motivations
● Problem– A single component of a disjunctive concept is to be
generalized● Guided solutions may improve effectiveness and efficiency of the
overall outcome compared to a random strategy
● Objective– Propose and evaluate different strategies for determining
the order in which the components should be considered for generalization
● If the generalization of a component fails, generalization of the next component in the ordering is attempted
Motivations
● Questions:– what sensible strategies can be defined?– what are their expected pros and cons?– what is their effect on the quality of the theory?– what about their consequences on the effectiveness and
efficiency of the learning process?● Starting point
– InTheLEx (Incremental Theory Learner from Examples)● fully incremental● can learn disjunctive concept definitions● refinement strategy can be tuned to suitably adapt its behavior
InTheLEx
● Learns hierarchical theories from positive and negative examples
● Fully incremental– May start from an empty theory and from the first
available example● Necessary in most real-world application domains
● DatalogOI – Concept definitions and examples expressed as rules
● Example: example :- observation● Rule: concept :- conjunctive definition
– Disjunctive concepts: several rules for the same concept
InTheLEx: Representation
– Theory that defines a disjunctive concept (4 components)● ball(A) :- weight_medium(A), air_filled(A), has_patches(A),
● Logic theory revision process– Given a new example
● No effect on the theory if– negative and not covered
● not predicted by the theory to belong to the concept– positive and covered
● predicted by the theory to belong to the concept● In all the other cases, the theory needs to be revised
– Positive example not covered generalization of the theory– Negative example covered specialization of the theory
– Refinements (generalizations or specializations) must preserve correctness with respect to the entire set of currently available examples
● If no candidate refinement fulfills this requirement, the specific problematic example is stored as an exception
InTheLEx: Generalization
– Procedure Generalize(E: positive example, T: theory, M: negative examples);
● L := list of the rules in the definition of E's conceptwhile not generalized and L do
– Select from L a rule C for generalization– L' := generalize(C,E) (* list of generalizations *)– while not generalized and L' do
● Select next best generalization C' from L'● if (T \ {C} {C'} is consistent wrt M then
● Implement C' in T– Remove C from L
● if not generalized then– C' := E with constants turned into variables– if (T \ {C} {C'} is consistent wrt M then
● Implement C' in T– else
● Implement E in T as an exception
InTheLEx: Generalization
– Comments● Due to theoretical and implementation details, the generalization
operator used might return several incomparable generalizations● Implementation of the theoretical generalization operator would
be computationally infeasible, even for relatively small rules– Similarity-based approximation
● Experiments have shown that it comes very close, and often catches, least general generalizations
● First example of a new concept– First rule rule added for that concept
● Initial tentative conjunctive definition of the concept● Conjunctive definition turns out to be insufficient
– Second rule added for that concept● The concept becomes disjunctive
● Subsequent addition of rules for that concept– Extend the ‘disjunctiveness’ of the concept
Clause Selection Strategy
● 5 strategies for determining the order in which the components of a disjunctive concept definition are to be considered for generalization– Each component in the ordering considered only after
generalization attempts have failed on all previous components
● Initial components have more chances to be generalized
– No direct connection between age and length of a rule● Older rules might have had more chances of refinement● Whether this means that they are also shorter (i.e., more
general) mainly depends on– Ranking strategy– Specific examples that are encountered and on their order
● Not controllable in a real-world setting
Clause Selection Strategy
● O: Older elements first– Same order as they were added to the theory
● The most straightforward. A sort of baseline● Static: position of each component in the processing order fixed
when component is added– Generalizations are monotonic (progressively remove
constraints)● Strategy expected to yield very refined (short, more
human-readable and understandable) initial components and very raw (long) final ones
– After several refinements, it is likely that initial components have reached a nearly final and quite stable form
● All attempts to further refine them are likely to fail● The computational time spent in these attempts, albeit
presumably not huge, will be wasted● Runtime expected to grow as long as the life of the
theory proceeds
Clause Selection Strategy
● N: Newer elements first– Reverse order as they were added to the theory
● Also quite straightforward● Static, but not obvious to foresee what will be the shape and
evolution of the components– Immediately after the addition of a new component, it will
undergo a generalization attempt at the first non-covered positive example
● ‘average’ level of generality in the definition expected to be less than in the previous option
● There should be no completely raw components in the definition
● Many chances that such an attempt is successful for any example, but the resulting generalization might leverage features that might be not very related to the correct concept definition
Clause Selection Strategy
● L: Longer elements first– Decreasing number of conjuncts
● Specifically considers the level of refinement– Low variance in degree of refinement among components
expected● Can be considered as an evolution of N
– Not just the most recently added component is favored for generalization
● The more conjuncts in a component, the more specialized the component
– More room for generalization● Avoid waste of time trying to generalize very refined rules
that would hardly yield consistent generalizations● On the other hand, generalizing a longer rule is expected
to take more time than generalizing shorter ones
Clause Selection Strategy
● S: Shorter elements first– Increasing number of conjuncts
● Specifically considers the level of refinement– Opposite behavior than L
● May confirm the possible advantage of spending time in trying harder but more promising generalizations versus spending time in trying easier but less promising ones first
● Can be considered as an evolution of O– Tries to generalize first more refined components
● Largest variance in degree of refinement (number of conjuncts in rule premises) among components expected
Clause Selection Strategy
● ~: More similar elements first– Decreasing similarity with the uncovered example
● Only content-based strategy– Same similarity as in InTheLEx’s generalization operator
● Disjunctive components different actualizations of a concept– Small intersection expected between the sets of examples
covered by different components● Similarity assessment may allow to identify the appropriate
component to be generalized for a given example– Odd generalizations for mismatched component-example
● Coverage and generalization problems● Bad theory, inefficient refinements
– One might expect that over-generalization is avoided– Generalization more easily computed, but overhead to
compute the similarity● Does the improvement compensate the overhead?
Evaluation
● Real-world dataset: Scientific papers– 353 layout descriptions of first pages– 4 classes: Elsevier journals, SVLN, JMLR, MLJ
● Classification: learn definitions for these classes● Understanding: the significant components in the papers
● Complex dataset– Some layout styles are quite similar– 67920 atoms in observations
● avg per observation >192 atoms, some >400 atoms
– Much indeterminacy● Membership relation of layout components to pages
Evaluation: Parameters
● Qualitative– # comp in the disjunctive concept definition
● Less components yield a more compact theory, which does not necessarily provide for greater accuracy
– avg length of components● More conjuncts, more specific -and less refined- concept
– # exc negative exceptions● More exceptions, worse theory
● Quantitative– acc (accuracy)
● Prediction capabilities of the theory on test examples
– time needed to carry out the learning task● Efficiency (computational cost) for the different ranking strategies
Evaluation
● Incremental approach justified– New instances of documents continuously available in
time● Comparison of all strategies
– Random not tried● O or N are somehow random
– Just append definitions as long as they are generated, without any insight
● 10-fold cross-validation procedure– Classification task used to check in detail the behavior of
the different strategies– Understanding task used to assess the statistical
significance of the difference in performance between different strategies
Evaluation: Classification
● Useful results only for MLJ and SVLN– Elsevier and JMLR: always single-rule definitions
● Ranking approach not applicable– Examples for Elsevier and JMLR still played a role as
negative examples for the other classes– Some expectations confirmed
● Runtime: N always best (often the first generalization attempt succeeds); ~ worst (need for computing the similarity of each component and the example)
● Number of exceptions: ~ always best (improved selection of components for generalization); S also good
● Aggregated indicator: N always wins– Sum of ranking positions for the different parameters (the
smaller, the better)
Evaluation: Classification
Evaluation: Classification
– Rest of behavior somehow mixed● SVLN: much less variance than in MLJ
– Definition quite clear and examples quite significant?● MLJ: Impact of the ranking strategies much clearer
– Substantial agreement between quality-related indicators: for each approach, either all tend to be good (as in ~) or all tend to be bad (as in S and O)
● Interesting indications from single folds– 3: peak in runtime and number of exceptions for S and O
● runtime a consequence of the unsuccessful search for specializations, that in turn may have some connection with the quality of the theory
● S is a kind of evolution of O– 8: accuracy increases from 82% to 94% for ~ and L
● Content-based approach improves quality of the theory in difficult situations
Evaluation: Understanding
● Strategy on the row equal (=), better (+) or worse (–) than the one on the column
● Best strategy in bold
Evaluation: Understanding
● Quite different, albeit run on the same data– When the difference is significant, in general analogous
behavior of L and ~ for accuracy (which is greater), number of components and number of negative exceptions. As expected, longer runtime for ~
– Also O and S have in general an analogous behavior, but do not reach as good results as L and ~
– N not outstanding for performance, but better than O (the baseline) for all parameters except runtime
– On the classification task, average length of components using L significantly larger than all the others
● Returns more balanced components
but more accuracy and less negative exceptions
Conclusions & Future Work
● Disjunctive concept definitions tricky– Each component covers a subset of positive examples,
ensuring consistency with all negative examples● In incremental learning, when a new positive example is not
recognized by the current theory, one component must be generalized
– Omission error (no specific component responsible)● The system must decide the order in which the elements
are to be considered for trying a generalization– 5 strategies proposed
● The outcomes confirm some of the expectations for the various strategies, but we need
– More extensive experimentation to have confirmations and additional details
– Identification of further strategies and refinement of the proposed ones