Automatic Automatic Induction of MAXQ Induction of MAXQ Hierarchies Hierarchies Neville Mehta Neville Mehta Michael Wynkoop Michael Wynkoop Soumya Ray Soumya Ray Prasad Tadepalli Prasad Tadepalli Tom Dietterich Tom Dietterich School of EECS School of EECS Oregon State University Oregon State University Funded by DARPA Transfer Learning Program
Automatic Induction of MAXQ Hierarchies. Neville Mehta Michael Wynkoop Soumya Ray Prasad Tadepalli Tom Dietterich School of EECS Oregon State University. Funded by DARPA Transfer Learning Program. Hierarchical Reinforcement Learning. Exploits domain structure to facilitate learning - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Automatic Automatic Induction of MAXQ Induction of MAXQ
HierarchiesHierarchiesNeville MehtaNeville Mehta
Michael WynkoopMichael WynkoopSoumya RaySoumya Ray
Directed acyclic graph of subtasksDirected acyclic graph of subtasks Leaves are the primitive MDP actionsLeaves are the primitive MDP actions
Traditionally, task structure is provided Traditionally, task structure is provided as prior knowledge to the learning agentas prior knowledge to the learning agent
Model RepresentationModel Representation
Dynamic Bayesian Networks for the Dynamic Bayesian Networks for the transition and reward modelstransition and reward models
Symbolic representation of the Symbolic representation of the conditional probabilities/reward conditional probabilities/reward values as decision treesvalues as decision trees
Avoid the significant manual engineering of Avoid the significant manual engineering of task decompositiontask decomposition Requiring deep understanding of the purpose Requiring deep understanding of the purpose
and function of subroutines in computer and function of subroutines in computer sciencescience
Frameworks for learning exit-option Frameworks for learning exit-option hierarchies:hierarchies: HexQ: Determine exit states through random HexQ: Determine exit states through random
explorationexploration VISA: Determine exit states by analyzing DBN VISA: Determine exit states by analyzing DBN
action modelsaction models
Focused Creation of Focused Creation of SubtasksSubtasks
HEXQ & VISA: Create a separate HEXQ & VISA: Create a separate subtask for each possible exit state.subtask for each possible exit state. This can generate a large number of This can generate a large number of
subtaskssubtasks Claim: Defining good subtasks requires Claim: Defining good subtasks requires
maximizing state abstraction while maximizing state abstraction while identifying “useful” subgoals.identifying “useful” subgoals.
Our approach: Our approach: selectivelyselectively define define subtasks with single abstract exit statessubtasks with single abstract exit states
Transfer Learning Transfer Learning ScenarioScenario
Working hypothesis:Working hypothesis: MaxQ value-function learning is much quicker MaxQ value-function learning is much quicker
than non-hierarchical (flat) Q-learningthan non-hierarchical (flat) Q-learning Hierarchical structure is more amenable to Hierarchical structure is more amenable to
transfer from source tasks to the target than transfer from source tasks to the target than value functionsvalue functions
Transfer scenario:Transfer scenario: Solve a “source problem” (no CPU time limit)Solve a “source problem” (no CPU time limit)
Solve a “target problem” under the assumption Solve a “target problem” under the assumption that the same hierarchical structure appliesthat the same hierarchical structure applies
Will relax this constraint in future workWill relax this constraint in future work
MaxNode State MaxNode State AbstractionAbstraction
Y is irrelevant within this actionY is irrelevant within this action It affects the dynamics but not the reward It affects the dynamics but not the reward
functionfunction In HEXQ, VISA, and our work, we assume In HEXQ, VISA, and our work, we assume
there is only one terminal abstract state, there is only one terminal abstract state, hence no pseudo-reward is neededhence no pseudo-reward is needed
As a side-effect, this enables “funnel” As a side-effect, this enables “funnel” abstractions in parent tasksabstractions in parent tasks
Rt+1
Xt
Yt
At
Xt+1
Yt+1
Our Approach: AI-MAXQOur Approach: AI-MAXQLearn DBN action models via
random exploration
(Other work)
Apply Q learning to solve the source problem
Generate a good trajectory from the learned Q function
A variable v is relevant to an action if the DBN for that action tests or changes that variable (this includes both the variable nodes and the reward nodes)
Create an arc from A to B labeled with variable v iff v is relevant to A and B but not to any intermediate actions.
CAT ScanCAT Scan
EndStart Goto MG Goto Dep Goto CW Goto Dep
An action is absorbed regressively as long An action is absorbed regressively as long asas It does not have an effect beyond the trajectory It does not have an effect beyond the trajectory
segment, preventing exogenous effectssegment, preventing exogenous effects It does not increase the state abstractionIt does not increase the state abstraction
Note that because each subtask has a unique terminal state, Result Distribution Irrelevance applies
ClaimsClaims
The resulting hierarchy is uniqueThe resulting hierarchy is unique Does not depend on the order in which goals Does not depend on the order in which goals
and trajectory sequences are analyzedand trajectory sequences are analyzed All state abstractions are safeAll state abstractions are safe
There exists a hierarchical policy within the induced There exists a hierarchical policy within the induced hierarchy that will reproduce the observed trajectoryhierarchy that will reproduce the observed trajectory
Extend MaxQ Node Irrelevance to the induced structureExtend MaxQ Node Irrelevance to the induced structure
Learned hierarchical structure is “locally Learned hierarchical structure is “locally optimal”optimal” No local change in the trajectory segmentation No local change in the trajectory segmentation
can improve the state abstractions (very weak)can improve the state abstractions (very weak)
Experimental SetupExperimental Setup
Randomly generate pairs of Randomly generate pairs of sourcesource--targettarget resource-gathering maps in Wargusresource-gathering maps in Wargus
Learn the optimal policy in Learn the optimal policy in sourcesource
Induce task hierarchy from a single (near) Induce task hierarchy from a single (near) optimal trajectoryoptimal trajectory
Transfer this hierarchical structure to the Transfer this hierarchical structure to the MaxQ value-function learner for MaxQ value-function learner for targettarget
Compare to direct Q learning, and MaxQ Compare to direct Q learning, and MaxQ learning on a manually engineered learning on a manually engineered hierarchy within hierarchy within targettarget
VISA only uses DBNs for causal informationVISA only uses DBNs for causal information Globally applicable across state space without Globally applicable across state space without
focusing on the pertinent subspacefocusing on the pertinent subspace ProblemsProblems
Global variable coupling might prevent concise Global variable coupling might prevent concise abstractionabstraction
Exit states can grow exponentially: one for each Exit states can grow exponentially: one for each path in the decision tree encodingpath in the decision tree encoding
Modified bitflip domain exposes these Modified bitflip domain exposes these shortcomingsshortcomings
Modified Bitflip DomainModified Bitflip Domain
State space: bState space: b00,…,b,…,bn-1n-1
Action space:Action space: Flip(i), 0 < i < n-1Flip(i), 0 < i < n-1
If bIf b00 … … b bi-1i-1 = 1 then b = 1 then bii ← ~b← ~bii
Causality analysis is the key to our Causality analysis is the key to our approachapproach
Enables us to find concise subtask Enables us to find concise subtask definitions from a demonstrationdefinitions from a demonstration CAT scan is easy to performCAT scan is easy to perform
Need to extend to learn from multiple Need to extend to learn from multiple demonstrationsdemonstrations Disjunctive goalsDisjunctive goals