Hierarchical Reinforcement Learning Mausam Survey and Comparison of HRL technique
Dec 30, 2015
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speed up RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research.Summarise
Personal Printerbot
States (S) : {loc,has-robot-printout, user-loc,has-user-printout},map
Actions (A) :{moven,moves,movee,movew, extend-arm,grab-page,release-pages}
Reward (R) : if h-u-po +20 else -1Goal (G) : All states with h-u-po true.
Start state : A state with h-u-po false.
Episodic Markov Decision Process
hS, A, P, R, G, s0i S : Set of environment states. A : Set of available actions. P : Probability Transition model. P(s’|
s,a)* R : Reward model. R(s)* G : Absorbing goal states. s0 : Start state. : Discount factor**.
* Markovian assumption.** bounds R for infinite horizon.
Episodic MDP ´ MDP with
absorbing goals
Episodic MDP ´ MDP with
absorbing goals
Goal of an Episodic MDP
Find a policy (S ! A), which:maximises expected discounted reward
for a a fully observable* Episodic MDP.if agent is allowed to execute for an
indefinite horizon.
* Non-noisy complete information perceptors
Solution of an Episodic MDP
Define V*(s) : Optimal reward starting in state s.
Value Iteration : Start with an estimate of V*(s) and successively re-estimate it to converge to a fixed point.
Complexity of Value Iteration
Each iteration – polynomial in |S|Number of iterations – polynomial in
|S|Overall – polynomial in |S|
Polynomial in |S| - |S| : exponential in number of features in the domain*.
* Bellman’s curse of dimensionality
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speed up RL. Illustrate prominent HRL methods. Compare prominent HRL methods.
Discuss future research. Summarise
Learning
Environment
Data
•Gain knowledge•Gain understanding•Gain skills•Modification of behavioural tendency
Decision Making while Learning*Environment
PerceptsDatum Action
What action next?
•Gain knowledge•Gain understanding•Gain skills•Modification of behavioural tendency
* Known as ReinforcementLearning
Reinforcement LearningUnknown P and reward R.Learning Component : Estimate the P
and R values via data observed from the environment.
Planning Component : Decide which actions to take that will maximise reward.
Exploration vs. Exploitation GLIE (Greedy in Limit with
Infinite Exploration)
Learning
Model-based learningLearn the model, and do planningRequires less data, more computation
Model-free learningPlan without learning an explicit modelRequires a lot of data, less computation
Q-LearningInstead of learning, P and R, learn Q*
directly. Q*(s,a) : Optimal reward starting in
s, if the first action is a, and after that the optimal policy is followed.
Q* directly defines the optimal policy:
Optimal policy is the action with maximum
Q* value.
Optimal policy is the action with maximum
Q* value.
Q-Learning
Given an experience tuple hs,a,s’,ri
Under suitable assumptions, and GLIE exploration Q-Learning
converges to optimal.
New estimate of Q value
New estimate of Q value
Old estimate of Q value
Old estimate of Q value
Semi-MDP: When actions take time.
The Semi-MDP equation:
Semi-MDP Q-Learning equation:
where experience tuple is hs,a,s’,r,Ni
r = accumulated discounted reward while action a was executing.
Printerbot
Paul G. Allen Center has 85000 sq ft space
Each floor ~ 85000/7 ~ 12000 sq ftDiscretise location on a floor: 12000
parts.State Space (without map) :
2*2*12000*12000 --- very large!!!!!How do humans do the
decision making?
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
1. The Mathematical Perspective
A Structure Paradigm S : Relational MDP A : Concurrent MDP P : Dynamic Bayes Nets R : Continuous-state MDP G : Conjunction of state
variables V : Algebraic Decision Diagrams : Decision List (RMDP)
2. Modular Decision Making
Humans plan modularly at different granularities of understanding.
Going out of one room is similar to going out of another room.
Navigation steps do not depend on whether we have the print out or not.
3. Background Knowledge
Classical Planners using additional control knowledge can scale up to larger problems.
(E.g. : HTN planning, TLPlan)What forms of control knowledge can
we provide to our Printerbot?First pick printouts, then deliver them.Navigation – consider rooms, hallway,
separately, etc.
A mechanism that exploits all three avenues : Hierarchies
1. Way to add a special (hierarchical) structure on different parameters of an MDP.
2. Draws from the intuition and reasoning in human decision making.
3. Way to provide additional control knowledge to the system.
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
HierarchyHierarchy of : Behaviour, Skill,
Module, SubTask, Macro-action, etc.picking the pagescollision avoidancefetch pages phasewalk in hallway
HRL ´ RL with temporally extended actions
Hierarchical Algos ´ Gating Mechanism
Hierarchical Learning•Learning the gating function•Learning the individual behaviours•Learning both
*
*Can be a multi- level hierarchy.
g is a gateg is a gate
bi is a behaviour
bi is a behaviour
Option : Movee until end of hallway
Start : Any state in the hallway.
Execute : policy as shown.
Terminate : when s is end of hallway.
Options [Sutton, Precup, Singh’99]
An option is a well defined behaviour.o = h Io, o, o i
Io : Set of states (IoµS) in which o can be initiated.
os : Policy (S!A*) when o is executing.
o(s) : Probability that o terminates in s.
*Can be a policy over lower level options.
Learning
An option is temporally extended action with well defined policy.
Set of options (O) replaces the set of actions (A)
Learning occurs outside options.Learning over options ´ Semi MDP Q-
Learning.
Machine: Movee + Collision Avoidance
Movew Moven Moven Return
Movew Moves Moves Return
Movee Choose
Return
End of hallway
: End of hallway
Obstacle
Call M1
Call M2
M1
M2
Hierarchies of Abstract Machines[Parr, Russell’97]
A machine is a partial policy represented by a Finite State Automaton.
Node :Execute a ground action.Call a machine as a subroutine.Choose the next node.Return to the calling machine.
Hierarchies of Abstract Machines
A machine is a partial policy represented by a Finite State Automaton.
Node :Execute a ground action.Call a machine as subroutine.Choose the next node.Return to the calling machine.
LearningLearning occurs within machines, as
machines are only partially defined.Flatten all machines out and consider
states [s,m] where s is a world state, and m, a machine node ´ MDP
reduce(SoM) : Consider only states where machine node is a choice node ´ Semi-MDP.
Learning ¼ Semi-MDP Q-Learning
Task Hierarchy: MAXQ Decomposition
[Dietterich’00]
Root
Take GiveNavigate(loc)
DeliverFetch
Extend-arm Extend-armGrab Release
MoveeMovewMovesMoven
Children of a task are
unordered
Children of a task are
unordered
MAXQ Decomposition
Augment the state s by adding the subtask i : [s,i].
Define C([s,i],j) as the reward received in i after j finishes.
Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*
Express V in terms of CLearn C, instead of learning Q
*Observe the context-free nature of Q-value
Reward received while navigating
Reward received while navigating
Reward received after navigation
Reward received after navigation
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
1. State AbstractionAbstract state : A state having
fewer state variables; different world states maps to the same abstract state.
If we can reduce some state variables, then we can reduce on the learning time considerably!
We may use different abstract states for different macro-actions.
State Abstraction in MAXQRelevance : Only some variables are
relevant for the task.Fetch : user-loc irrelevantNavigate(printer-room) : h-r-po,h-u-po,user-
locFewer params for V of lower levels.
Funnelling : Subtask maps many states to smaller set of states. Fetch : All states map to h-r-po=true,
loc=pr.room. Fewer params for C of higher levels.
State Abstraction in Options, HAM
Options : Learning required only in states that are terminal states for some option.
HAM : Original work has no abstraction.Extension: Three-way value
decomposition*:Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])
Similar abstractions are employed.*[Andre,Russell’02]
OptimalityOptions :
HierarchicalUse (A [ O) : Global**Interrupt options
HAM : Hierarchical*MAXQ : Recursive*
Interrupt subtasksUse Pseudo-rewardsIterate!
* Can define eqns for both optimalities**Adv. of using macro-actions maybe lost.
3. Language ExpressivenessOption
Can only input a complete policyHAM
Can input a complete policy.Can input a task hierarchy.Can represent “amount of effort”.Later extended to partial programs.
MAXQCannot input a policy (full/partial)
4. Knowledge Requirements
OptionsRequires complete specification of policy.One could learn option policies – given
subtasks.HAM
Medium requirementsMAXQ
Minimal requirements
5. Models advancedOptions : ConcurrencyHAM : Richer representation,
ConcurrencyMAXQ : Continuous time, state, actions;
Multi-agents, Average-reward.In general, more researchers have
followed MAXQLess input knowledgeValue decomposition
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
Directions for Future Research
Bidirectional State AbstractionsHierarchies over other RL
researchModel based methodsFunction Approximators
Probabilistic PlanningHierarchical P and Hierarchical R
Imitation Learning
Directions for Future Research
TheoryBounds (goodness of hierarchy)Non-asymptotic analysis
Automated Discovery Discovery of HierarchiesDiscovery of State Abstraction
Apply…
Applications
Toy Robot Flight SimulatorAGV SchedulingKeepaway
soccer
Parts
Assemblies
Ware-house
P2 P1
P3 P4
D2
D3 D4
D1
Images courtesy various sources
Thinking Big…
"... consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*.” -- David Andre
Use planners, theorem provers, etc. as components in big hierarchical solver.
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
How to choose appropriate hierarchy
Look at available domain knowledgeIf some behaviours are completely
specified – optionsIf some behaviours are partially
specified – HAMIf less domain knowledge available –
MAXQWe can use all three to specify
different behaviours in tandem.