A Retrospective Look at A Retrospective Look at Classiﬁer System ResearchClassiﬁer System Research

© 2006 The MITRE Corporation. All rights reserved.

A Retrospective Look atA Retrospective Look atClassifier System ResearchClassifier System Research

Lashon B. BookerThe MITRE Corporation


Early Motivations for Learning ClassifierEarly Motivations for Learning ClassifierSystem (System (LCSLCS) Research) Research

Design symbolic problem solvers that avoid brittleness inrealistic (uncertain and continually varying) domainsinvolving– On-line, adaptive control of behaviors

Representations and procedures must adjust without unnecessarilydisrupting existing capabilities

– Discovering relevant categories in a complex and unlabeledstream of inputInputs must be incrementally grouped together into plausible classes

This is especially difficult when behavior requires moreknowledge representation and processing capability thanis available with simple empirical associations betweeninputs and outputs


Requirements for Non-Brittle Rule-BasedRequirements for Non-Brittle Rule-BasedBehaviorBehavior

Need to identify and take advantage of the exploitableregularities in the environment

Generalizations must be selective, pragmatic and subjectto exceptions

Learning must be incremental and closely coupled withperformance and with unfolding reality

Rules must be treated as tentative hypotheses (not logicalassertions) subject to testing and conformation– Hypothesis “strength” is derived from experienced-based

predictions of performance– Strength is used to determine rule fitness and infer plausibility


Observations about early researchObservations about early research

The Holland and Reitman collaboration placed a strongemphasis on cognition and characterized the problems ofinterest

Viewed classifier systems as symbolic problem solvers thatavoid brittle behavior (an alternative to expert systems)

– Treat rule set as a model and rules as parts in a context– Evaluation of parts is context dependent (i.e., aspects are non-stationary)

Learning emphasized policy search and value estimation– Rules are policy elements along with performance estimators– Adjust policy via natural selection among rule types– The Pitt approach preserved this idea, using the GA for direct policy search

Included provisions for motivation, affect and introspection

These ideas provided the foundation for a comprehensive theoryof induction (rule clusters, distributed representations,associations, spreading activation, etc.)


Influence of reinforcement learningInfluence of reinforcement learning

Reinforcement learning problems arefaced by agents that must learn actionsequences from trial-and-error– Framework provides attractive formalisms

based on estimating value functions (withkey contributions from Sutton and Barto)

– Algorithms provide useful benchmarks forcomparisons

Emphasis on value functions has hada strong influence on LCS research– The primary niche is learning compact

value function representations for off-policy temporal difference methods

– But, the RL community has goodalternatives

It is not clear if we are learning thebest generalizations, or givingsufficient emphasis to policyimprovement

Solution strategies:

• Search the space of possible behaviors

• Estimate utility of taking actions inworld states

Action

EnvironmentState

LearningAgent

input

scalarfeedback


Value-based generalizations arenValue-based generalizations aren’’t often intuitivet often intuitive

Grefenstette’s 9x32 abstract state space

There are many obvious intuitive solution strategies:– E.g. Move left or right to column with highest reward, then go straight

Classifier systems tend to learn piecemeal strategies rather than coherentones

– Many narrowly-focused general rules are needed to get the overall solution– Generalizations correspond to symmetries in the reward distribution e.g., (Row = 111) (Column = #011#) RIGHT ) not the key attribute-based concepts.– This distinction has been irrelevant in most classifier system test problems (e.g.,

multiplexor and Woods problems)

Start

0050

75

125

250

500

1000

1000

500

250

125

75

50

000050

75

125

250

500

1000

1000

500

250

125

75

50

00 0050

75

125

250

500

1000

1000

500

250

125

75

50

000050

75

125

250

500

1000

1000

500

250

125

75

50

00


Off-policy Methods Learn DifferentOff-policy Methods Learn DifferentBehaviorsBehaviors

Since Q learning is an off-policy method(i.e., behavior policy may differ fromestimation policy), it does not suffernegative consequences for exploration

Sarsa (i.e. the bucket brigade) is an on-policy method, so its solution accountsfor the consequences of exploration

In real problems where on-line errorsare costly, this distinction is important

This also has architectural implications(e.g., how to approximate the valuefunction)

Bottom line: we need to identify and build on the strengths of the LCSapproach. The key may be in specifying a set of organizing principlesthat go beyond implementation diagrams


Soar Architecture of Intelligent Rule-basedSoar Architecture of Intelligent Rule-basedBehaviorBehavior

Reaction

Deliberation

Reflection

Slower

Faster

High

Intelligence

Low

Intelligence

I/O

Learning

Reaction

Deliberation

Reflection

Reaction

Deliberation

Reflection

Slower

Faster

Slower

Faster

High

Intelligence

Low

Intelligence

I/O

Learning

Derived by Newell and his students (~1980), also as a responseto the expert system phenomenon

Based on a theory of problem solving (i.e., problem spaces),along with a companion view of learning (i.e., chunking)

The theory was operationalized as an architecture that hasserved that community well


What kind of architecture makes sense forWhat kind of architecture makes sense forclassifier systems?classifier systems?

The key role of policyimprovement suggests thatan actor-critic structure maybe a good start

The idea is to intermix valueiteration and policyimprovement continually(state by state, action byaction, sample by sample)

Is there an organizingprinciple that extends thisconcept to cover many formsof induction at differentscales? (includingperception, reasoning, andaction)

Actor

!

Critic

V, Q

policyevaluation

policyimprovement

!*

V*, Q

*

value learning

greedification

Actor

!

Critic

V, Q

policyevaluation

policyimprovement

!*

V*, Q

*

value learning

greedification


DARPA/IPTODARPA/IPTO Focus on Cognitive Systems Focus on Cognitive Systems

Darpa views a cognitive system as one that– can reason, using substantial amounts of appropriately represented knowledge– can learn from its experience so that it performs better tomorrow than it did today– can explain itself and be told what to do– can be aware of its own capabilities and reflect on its own behavior– can respond robustly to surprise

Learning is ubiquitous. Different forms operate at different times andplaces

What niche is the LCS community best suited to fill?


Some Open Problems for ReinforcementSome Open Problems for ReinforcementLearning (Sutton) - and Classifier SystemsLearning (Sutton) - and Classifier Systems

Incomplete state information Exploration Structured states and actions Incorporating prior knowledge Using teachers Theory of RL with function approximators Modular and hierarchical architectures Integration with other problem–solving and

planning methods

A Retrospective Look at A Retrospective Look at Classiﬁer System ResearchClassiﬁer System Research

Technology