-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Structured Explorationfor Reinforcement Learning
Nicholas K. Jong
Department of Computer SciencesThe University of Texas at
Austin
December 1, 2010 / PhD Final Defense
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Outline
1 Introduction
2 Exploration and Approximation
3 Exploration and Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Outline
1 IntroductionThe Reinforcement Learning ProblemReinforcement
Learning MethodsThesis Focus
2 Exploration and Approximation
3 Exploration and Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
One Solution to Many Problems
Reality Many tasks mean many engineering problemsDream
Opportunity for a single general learning algorithm
Potential PayoffsReduce engineering costsSolve problems beyond
our current abilitiesAchieve solutions robust to uncertainty
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
One Formalism for Many Problems
EnvironmentGenerate reward r ∈ R with expected value
R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using
unknown reward and transition functions R and P
Goal Find a policy π : S → A that maximizes future rewards
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
One Formalism for Many Problems
Agent
Observes state s ∈ SChooses action a ∈ AFor arbitrary S and
A
Agent Environment
a
r,s
EnvironmentGenerate reward r ∈ R with expected value
R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using
unknown reward and transition functions R and P
Goal Find a policy π : S → A that maximizes future rewards
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
One Formalism for Many Problems
Agent
Observes state s ∈ SChooses action a ∈ AFor arbitrary S and
A
Agent Environment
a
r,s
EnvironmentGenerate reward r ∈ R with expected value
R(s,a)Generate next state s′ ∈ S with probability P(s,a, s′)Using
unknown reward and transition functions R and P
Goal Find a policy π : S → A that maximizes future rewards
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Example: A Resource Gathering Simulation
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
B
Simulated Robot’s TaskGather each of n resourcesNavigate around
danger zones
n + 2 State VariablesBoolean flag for each resource: A,B, . . .x
and y coordinates
n + 4 Actionsnorth, south, east, west change x and ypickupA sets
flag A if near resource A, etc.Actions cost −1 generally but up to
−40 in “puddles”
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Evaluating Policies with Value Functions
The Bellman EquationState value depends on policy and action
valuesLong-term value equals present value plus future value.
Vπ(s) = Qπ (s, π(s))
Qπ(s,a) = R(s,a) + γ∑s′
P(s,a, s′)Vπ(s′)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Evaluating Policies with Value Functions
The Bellman EquationState value depends on policy and action
valuesLong-term value equals present value plus future value.
Vπ = πQπ
Qπ = R + γPVπ
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Evaluating Policies with Value Functions
The Bellman EquationState value depends on policy and action
valuesLong-term value equals present value plus future value.
Vπ = πQπ
Qπ = R + γPVπ
V Policyπ Qs sa ℝ
Q RewardR
TransitionP V
sa
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Example: An Optimal Value Function
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
BSome policy achieves maximal V ∗
Planning algorithms compute V ∗ from R,PBut RL algorithms don’t
know R and P
V ∗(x , y , {C})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {C,D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Example: An Optimal Value Function
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
BSome policy achieves maximal V ∗
Planning algorithms compute V ∗ from R,PBut RL algorithms don’t
know R and P
V ∗(x , y , {C})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {C,D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
V ∗(x , y , {D})
0 0.2
0.4 0.6
0.8 1 0
0.2 0.4
0.6 0.8
1
-70-60-50-40-30-20-10
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Standard Approach: Learn the Value Function
Temporal Difference Learning
Estimate Vπ directly from data.(Sutton and Barto, 1998)
Given each piece of data 〈s,a, r , s′〉r + γV̂π(s′) is an
estimate of Vπ(s).Update V̂π(s) towards this estimate.Improve
π.
Converges to the optimal policy in the limit, givenappropriate
data.In practice, converges very slowly!
Most RL research focuses on ways to compute valuefunctions more
efficiently from data.
Action Selection
Data
Q
π
Value Function
Estimation
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Scaling to Real-World Problems
Theory Eventual convergence to optimal behaviorPractice Too slow
for interesting problems
Branches of RL ResearchFunction ApproximationHierarchical
RLRelational RLInverse RLEtc.
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Scaling to Real-World Problems
Theory Eventual convergence to optimal behaviorPractice Too slow
for interesting problems
Branches of RL ResearchFunction ApproximationHierarchical
RLRelational RLInverse RLEtc.
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Scaling to Real-World Problems
Theory Eventual convergence to optimal behaviorPractice Too slow
for interesting problems
Branches of RL ResearchFunction ApproximationHierarchical
RLRelational RLInverse RLEtc.
Reinforcement Learning
Hierarchical Decomposition
Function Approximation
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Exploration and Exploitation
ExploitationHow to estimate Q∗ from dataFocus of most RL
research
ExplorationHow to gather better dataEmphasized by model-based
RLFocus of this thesis
Reinforcement Learning
Model-Based Exploration
Action Selection
Data
Q
π
Value Function
Estimation
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Exploration and Exploitation
ExploitationHow to estimate Q∗ from dataFocus of most RL
research
ExplorationHow to gather better dataEmphasized by model-based
RLFocus of this thesis
Reinforcement Learning
Model-Based Exploration
Action Selection
Data
Q
π
Value Function
Estimation
Environment
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Exploration and Exploitation
ExploitationHow to estimate Q∗ from dataFocus of most RL
research
ExplorationHow to gather better dataEmphasized by model-based
RLFocus of this thesis
Reinforcement Learning
Model-Based Exploration
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Environment
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Thesis Contributions
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Merging Branches of RLPreviously studied
inisolationDemonstration ofsynergies
Efficient exploration in continuous state spacesEfficient
exploration given hierarchical knowledgeFramework for combining
algorithmic ideasPublicly available implementation of final
agent
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Thesis Contributions
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured ExplorationMerging Branches of RL
Previously studied inisolationDemonstration ofsynergies
Efficient exploration in continuous state spacesEfficient
exploration given hierarchical knowledgeFramework for combining
algorithmic ideasPublicly available implementation of final
agent
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
The Reinforcement Learning ProblemReinforcement Learning
MethodsThesis Focus
Thesis Contributions
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured ExplorationMerging Branches of RL
Previously studied inisolationDemonstration ofsynergies
Efficient exploration in continuous state spacesEfficient
exploration given hierarchical knowledgeFramework for combining
algorithmic ideasPublicly available implementation of final
agent
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Outline
1 Introduction
2 Exploration and ApproximationModel-Based
ExplorationGeneralization in Large State SpacesThe Fitted R-MAX
Algorithm
3 Exploration and Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Indirection Permits SimplicityR,P predict only one time stepR,P
involve only one action at a timeDirect training data permits
supervised learning
Uncertainty Guides ExplorationUse model of known states to reach
the unknownFirst polynomial-time sample-complexity bounds(Kearns
and Singh, 1998; Kakade, 2003)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Indirection Permits SimplicityR,P predict only one time stepR,P
involve only one action at a timeDirect training data permits
supervised learning
Uncertainty Guides ExplorationUse model of known states to reach
the unknownFirst polynomial-time sample-complexity bounds(Kearns
and Singh, 1998; Kakade, 2003)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model-Based Reinforcement Learning
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Indirection Permits SimplicityR,P predict only one time stepR,P
involve only one action at a timeDirect training data permits
supervised learning
Uncertainty Guides ExplorationUse model of known states to reach
the unknownFirst polynomial-time sample-complexity bounds(Kearns
and Singh, 1998; Kakade, 2003)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson,
1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state
spacesUnreliable with small sample sizes
R
-1
-1
-1
P
0.5
0.5
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson,
1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state
spacesUnreliable with small sample sizes
R
-1
-1
rmax
P
0.0
0.0
1.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson,
1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state
spacesUnreliable with small sample sizes
R
-1
-1
-1
rmax
P
0.0
0.0
1.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson,
1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state
spacesUnreliable with small sample sizes
R
-1
-1
-1
-1
rmax
P
0.0
0.0
1.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson,
1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state
spacesUnreliable with small sample sizes
R
-1-1
-1
-1
-1
-1
P
0.8
0.2
0.0
Small sample sizesUse optimistic model
Given enough dataUse MLE model
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Simple and Efficient Learning with R-max(Moore and Atkeson,
1993; Brafman and Tennenholtz, 2002)
Maximum-Likelihood EstimationStraightforward in finite state
spacesUnreliable with small sample sizes
R
-1-1
-1
-1
-1
-1
P
0.8
0.2
0.0
Q
R
P V
sa
sℝ
Unknown?
Vmax
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Challenges for Model-Based Reinforcement Learning
Computational ComplexityMDP planning can be expensive...But CPU
cycles are cheaper than data
Representational ComplexityState distributions harder to
represent than scalar values...But simple approximations may
suffice
Exhaustive ExplorationExploring every unknown state seems
unnecessary...But intuitive domain knowledge can constrain
exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Challenges for Model-Based Reinforcement Learning
Computational ComplexityMDP planning can be expensive...But CPU
cycles are cheaper than data
Representational ComplexityState distributions harder to
represent than scalar values...But simple approximations may
suffice
Exhaustive ExplorationExploring every unknown state seems
unnecessary...But intuitive domain knowledge can constrain
exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Challenges for Model-Based Reinforcement Learning
Computational ComplexityMDP planning can be expensive...But CPU
cycles are cheaper than data
Representational ComplexityState distributions harder to
represent than scalar values...But simple approximations may
suffice
Exhaustive ExplorationExploring every unknown state seems
unnecessary...But intuitive domain knowledge can constrain
exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Function Approximation
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach
state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some
basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more
parameters.Each parameter influences the value of several
states.
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach
state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some
basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more
parameters.Each parameter influences the value of several
states.
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach
state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some
basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more
parameters.Each parameter influences the value of several
states.
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Function Approximation
Problem Exact representation of V requires a parameter foreach
state. Many environments have infinite states!
Key IdeaRepresent Vπ using a small number of parameters.
ExamplesThe weights of a neural networkCoefficients of some
basis functions: Vπ =
∑i w
πi φi
Generalization of valuesChanging Vπ(s) changes one or more
parameters.Each parameter influences the value of several
states.
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted
average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning
with an exact MDPExact planning with an approximate MDP
Q RewardR
TransitionP V
sa
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted
average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning
with an exact MDPExact planning with an approximate MDP
Q R
P Vsa
s
ℝΦ x
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted
average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning
with an exact MDPExact planning with an approximate MDP
Q R
P Vsa
s
ℝΦ x
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Iteration(Gordon, 1995)
Averagers
Parameterize Vπ with values Vπ(X ) on X ⊂ SVπ(s) is a weighted
average
∑x∈X φ(s, x)V
π(x)
Discrete Planning in Continuous State SpacesApproximate planning
with an exact MDPExact planning with an approximate MDP
Q R
P Vsa
s
ℝΦ x
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model Approximation
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
i
s
i1
i2
3
P
sa s
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
s0.44
0.38
0.18i1
i2
i3
P
Ψsa si s
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
0.18
1.0
1.0
1.0
i1
i2
i3
s
0.38
0.44
P
Ψ Datasa si s
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Model Approximation(Jong and Stone, 2007b)
Approximate sa using instances i =〈si ,ai , ri , s′i
〉Ψ(sa, i) Model averager weighting sa against siai
DP(si , s′) Empirical effect applying transition at i to s
s
0.18
0.440.38
P
Ψ Datasa si s
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,PP
R
Ψ DP
DRΨ
sa
sa
si
si
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,PP
R
Unknown?
Ψ DP
Unknown?
DRΨ
Vmax
Pterm
sa
sa
sa
sa
sa
sa
si
si
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted R-MAX(Jong and Stone, 2007a)
Model approximation, R-MAX exploration, value approximation
Planning
Action Selection
Data
Q
π
Model Estimation
R,PP
R
Unknown?
Ψ DP Φ
Unknown?
DRΨ
Vmax
Pterm
sa
sa
sa
sa
sa
sa
si
si s
s
ℝ
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
An Instance of Fitted R-MAX
Model Averager
ψ(sa, si) ∝ K (s, si)δ(a, sa)
K (s, s′) = exp(
d(s,s′)2
b2
)“radial basis data”
R-MAX Exploration
sa known if sufficient weight:∑
i | ai=a K (s, si) ≥ m
Value Averager
Interpolation over a uniform grid
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
An Instance of Fitted R-MAX
Model Averager
ψ(sa, si) ∝ K (s, si)δ(a, sa)
K (s, s′) = exp(
d(s,s′)2
b2
)“radial basis data”
R-MAX Exploration
sa known if sufficient weight:∑
i | ai=a K (s, si) ≥ m
Value Averager
Interpolation over a uniform grid
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
An Instance of Fitted R-MAX
Model Averager
ψ(sa, si) ∝ K (s, si)δ(a, sa)
K (s, s′) = exp(
d(s,s′)2
b2
)“radial basis data”
R-MAX Exploration
sa known if sufficient weight:∑
i | ai=a K (s, si) ≥ m
Value Averager
Interpolation over a uniform grid
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Benchmark Performance
-500
-400
-300
-200
-100
0
0 200 400 600 800 1000
Rew
ard
per
epis
ode
Episodes
Fitted R-MAXXAI
Least Squares Policy IterationR-MAX
Neural Fitted Q Iteration
For n = 1 resource, almostequivalent to benchmarkdomain
“Puddleworld”Can compare againstperformance data fromNIPS RL
BenchmarkingWorkshop (2005)State-of-the-art algorithmsimplemented
and tuned byother researchers
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 250 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-8
-7
-6
-5
-4
-3
-2
-1
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 500 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-8
-7
-6
-5
-4
-3
-2
-1
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 750 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-25
-20
-15
-10
-5
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 1000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-35
-30
-25
-20
-15
-10
-5
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 1500 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-70
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 2000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-70
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 3000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 4000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Fitted Value Functions for PuddleWorld
Policy and value function with 5000 instances
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1 0
0.2
0.4
0.6
0.8
1
-70
-60
-50
-40
-30
-20
-10
0
-60
-50
-40
-30
-20
-10
0
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Generalization and Exploration
Inductive BiasModel-Free Similar states have similar values
Model-Based Similar states have similar dynamics
Model GeneralizationIs the effect of sa known or unknown?Less
generalization leads to more exploration
Value GeneralizationHow good is my policy π?Less generalization
leads to more computation
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Generalization and Exploration
Inductive BiasModel-Free Similar states have similar values
Model-Based Similar states have similar dynamics
Model GeneralizationIs the effect of sa known or unknown?Less
generalization leads to more exploration
Value GeneralizationHow good is my policy π?Less generalization
leads to more computation
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Model-Based ExplorationGeneralization in Large State SpacesThe
Fitted R-MAX Algorithm
Generalization and Exploration
Inductive BiasModel-Free Similar states have similar values
Model-Based Similar states have similar dynamics
Model GeneralizationIs the effect of sa known or unknown?Less
generalization leads to more exploration
Value GeneralizationHow good is my policy π?Less generalization
leads to more computation
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Outline
1 Introduction
2 Exploration and Approximation
3 Exploration and HierarchyHierarchical DecompositionThe R-MAXQ
and Fitted R-MAXQ AlgorithmsThe Utility of Hierarchy
4 Conclusion
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The Appeal of Hierarchy
Realistic ProblemsMany states and many actions...But also deep
structureMultiple levels of abstractionLocal dependencies
Structured Learning and PlanningDon’t write all programs in
assembly!Reason above the level of primitive actions.
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchy in Reinforcement Learning
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchy in Reinforcement Learning
Options (Sutton, Precup, and Singh, 1999)Partial policies as
macrosAn option o comprises:
An initiation set Io ⊂ SAn option policy πo : S → AA termination
function T o : S → [0,1]
MAXQ (Dietterich, 2000)A hierarchy of RL problemsA task o
comprises:
A set of subtasks Ao
A goal reward function Go : T o → RA set of terminal states T
o
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchy in Reinforcement Learning
Options (Sutton, Precup, and Singh, 1999)Partial policies as
macrosAn option o comprises:
An initiation set Io ⊂ SAn option policy πo : S → AA termination
function T o : S → [0,1]
MAXQ (Dietterich, 2000)A hierarchy of RL problemsA task o
comprises:
A set of subtasks Ao
A goal reward function Go : T o → RA set of terminal states T
o
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
...sa
s
ℝTurn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝTurn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝ
V Reach
Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) + V Reach(s) +
QDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) +
V Reach(s) + CDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) =
V Turn(s) +
QReachTurn (s) + CDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
MAXQ Value Function Decomposition
High-Level Rewards are Low-Level Values
Separate Qo into components Qoa by actionCompute Roa = V o
recursivelyLearn Coa := γPoa V o directly
QDrive
switch(a)
QDriveReach R ReachDrive
...sa
s
ℝ
V Reach
CDriveReach Turn Left
Reach Highway
Drive to Campus
...
...
V Drive(s) = V Turn(s) + CReachTurn (s) + CDriveReach(s)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition(Jong and Stone, 2008)
High-Level Successors are Low-Level Terminals
PDriveReach = ΩReach
Ωo(s, s′): Discounted probability thatexecuting o in s
terminates at s′
Ωo(·, s′) is a value function!
QDrive
switch(a)
QDriveReach
P DriveReach
R ReachDrive
V Drive
...sa
s
s
ℝ
V ReachTurn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition(Jong and Stone, 2008)
High-Level Successors are Low-Level Terminals
PDriveReach = ΩReach
Ωo(s, s′): Discounted probability thatexecuting o in s
terminates at s′
Ωo(·, s′) is a value function!
QDrive
switch(a)
QDriveReach
Ω Reach
R ReachDrive
V Drive
...sa
s
s
ℝ
V ReachTurn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchical Model Decomposition(Jong and Stone, 2008)
High-Level Successors are Low-Level Terminals
PDriveReach = ΩReach
Ωo(s, s′): Discounted probability thatexecuting o in s
terminates at s′
Ωo(·, s′) is a value function!
Ω
πT
P Ωs
sa
s
s
Turn Left
Reach Highway
Drive to Campus
...
...
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The R-MAXQ Algorithm(Jong and Stone, 2008)
Planning
Action Selection
Data
Q
π
Model Estimation
R,P
Primitive TasksLearn primitive models from dataSplice in R-MAX
optimistic explorationResult: V a and Ωa
Composite Tasks
Concatenate subtask V a and Ωa into Ro and Po
Plan πo using MAXQ goal rewardsEvaluate πo without goal
rewardsResult: V o and Ωo
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The R-MAXQ Algorithm(Jong and Stone, 2008)
Policy Evaluation
Data
Va,Ωa
Model Estimation
Primitive TasksLearn primitive models from dataSplice in R-MAX
optimistic explorationResult: V a and Ωa
Composite Tasks
Concatenate subtask V a and Ωa into Ro and Po
Plan πo using MAXQ goal rewardsEvaluate πo without goal
rewardsResult: V o and Ωo
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The R-MAXQ Algorithm(Jong and Stone, 2008)
Planning
Policy Evaluation
Va,Ωa
π
Vo,Ωo
Model Assembly
R,P
Primitive TasksLearn primitive models from dataSplice in R-MAX
optimistic explorationResult: V a and Ωa
Composite Tasks
Concatenate subtask V a and Ωa into Ro and Po
Plan πo using MAXQ goal rewardsEvaluate πo without goal
rewardsResult: V o and Ωo
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The Fitted R-maxq Algorithm
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The Fitted R-maxq Algorithm(Jong and Stone, 2009)
oVo Ω o
π
PoRo
Va Ω a
a
PDRDData
To
Model Approximation
Optimistic Exploration
Ψa
Ua
Value Approximation oΦ
Prediction
Planning Go
Primitive Action ModelsDefine V a = UV max+
(I − U)ΨaRDDefine Ωa = (I−U)ΨaPD
Prediction
Solve V o = πo(Ro + γPo(I − T o)V o)
Solve Ωo = πo(PoT o + γPo(I − T o)Ωo)
Planning
Optimize Ṽ o = T oGo+(I − T o)πo(Ro + γPoṼ o)
Value Approximation
Define Ro[sa] = V a[s]
Define Po[sa, x ] = Ωa[s, s′]Φo[s′, x ]
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The Software Architecture
Algorithm1 Execute πRoot hierarchically2 Update data: RD and PD3
Propagate changes to πRoot
4 Repeat
Averagers
Φ Interpolation over uniform gridΨ Radial basis functions
oVo Ω o
π
PoRo
Va Ω a
a
PDRDData
To
Model Approximation
Optimistic Exploration
Ψa
Ua
Value Approximation oΦ
Prediction
Planning Go
Optimizations
Memoization and DP
Prioritized sweeping
Sparse representations
Cover trees for onlinenearest neighbors
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Example: A Resource Gathering Simulation
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
C
A
D
B
Simulated Robot’s TaskGather each of n resourcesNavigate around
danger zones
n + 2 State VariablesBoolean flag for each resource: A,B, . . .x
and y coordinates
n + 4 Actionsnorth, south, east, west change x and ypickupA sets
flag A if near resource A, etc.Actions cost −1 generally but up to
−40 in “puddles”
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The Utility of Hierarchy and Model Generalization
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 5 10 15 20 25 30 35 40
Rew
ard
per
epis
ode
Episode
Fitted R-MAXR-MAX
-30000
-25000
-20000
-15000
-10000
-5000
0
0 20 40 60 80 100
Cum
ulat
ive
rew
ard
Episode
Fitted R-MAXR-MAX
Model generalization allows Fitted R-MAX to outperform
R-MAX.
Hierarchical decomposition allows R-MAXQ to outperform
R-MAX.
These two ideas synergize in Fitted R-MAXQ!
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The Utility of Hierarchy and Model Generalization
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 5 10 15 20 25 30 35 40
Rew
ard
per
epis
ode
Episode
R-MAXQR-MAX
-30000
-25000
-20000
-15000
-10000
-5000
0
0 20 40 60 80 100
Cum
ulat
ive
rew
ard
Episode
R-MAXQR-MAX
Model generalization allows Fitted R-MAX to outperform
R-MAX.
Hierarchical decomposition allows R-MAXQ to outperform
R-MAX.
These two ideas synergize in Fitted R-MAXQ!
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
The Utility of Hierarchy and Model Generalization
-4000
-3500
-3000
-2500
-2000
-1500
-1000
-500
0
0 5 10 15 20 25 30 35 40
Rew
ard
per
epis
ode
Episode
Fitted R-MAXQFitted R-MAX
R-MAXQR-MAX
-30000
-25000
-20000
-15000
-10000
-5000
0
0 20 40 60 80 100
Cum
ulat
ive
rew
ard
Episode
Fitted R-MAXQFitted R-MAX
R-MAXQR-MAX
Model generalization allows Fitted R-MAX to outperform
R-MAX.
Hierarchical decomposition allows R-MAXQ to outperform
R-MAX.
These two ideas synergize in Fitted R-MAXQ!
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Task Hierarchies as Domain Knowledge
Flat
...north south east west pickupA
Root
Shallow
...
south
GatherA
north
Root
east west pickupA
Deep
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
Flat hierarchy only knows modelaveragers.
Shallow hierarchy also knows that gatheringeach resource is
independent.
Deep hierarchy also knows the set of resource locations (butmust
still associate resource with location).
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Soft Inductive Bias in Hierarchy
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
s
BC
A
D
x
...
south
GatherA
north
Root
east west pickupA
From s, explore unknown state inpuddle or exploit known
solution?
Flat HierarchyOptimism about the unknown effectsof pickupD at x
outweighs value ofknown solution, Vπ(s) > Vπ(s).
Shallow HierarchyValue of pickupD at x less than valueof known
solution in the context ofGatherD, Vπ
GatherD(s) < Vπ
GatherD(s).
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Soft Inductive Bias in Hierarchy
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
s
x
BC
A
D
...
south
GatherA
north
Root
east west pickupA
From s, explore unknown state inpuddle or exploit known
solution?
Flat HierarchyOptimism about the unknown effectsof pickupD at x
outweighs value ofknown solution, Vπ(s) > Vπ(s).
Shallow HierarchyValue of pickupD at x less than valueof known
solution in the context ofGatherD, Vπ
GatherD(s) < Vπ
GatherD(s).
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Soft Inductive Bias in Hierarchy
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
s
x
BC
A
D
...
south
GatherA
north
Root
east west pickupA
From s, explore unknown state inpuddle or exploit known
solution?
Flat HierarchyOptimism about the unknown effectsof pickupD at x
outweighs value ofknown solution, Vπ(s) > Vπ(s).
Shallow HierarchyValue of pickupD at x less than valueof known
solution in the context ofGatherD, Vπ
GatherD(s) < Vπ
GatherD(s).
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchical Constraint and Reformulation
Hierarchies can find embeddedstructure.Hierarchies can constrain
policiesand therefore exploration.
Deep Hierarchypickup actions only possible beforeor after
Navigate tasks.
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
BC
A
D
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchical Constraint and Reformulation
Hierarchies can find embeddedstructure.Hierarchies can constrain
policiesand therefore exploration.
Deep Hierarchypickup actions only possible beforeor after
Navigate tasks.
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
BC
A
D
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
Reinforcement Learning
Model-Based Exploration
Hierarchical Decomposition
Function Approximation
Hierarchical Exploration
Approximate Exploration
Structured Exploration
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Hierarchical DecompositionThe R-MAXQ and Fitted R-MAXQ
AlgorithmsThe Utility of Hierarchy
Hierarchical Constraint and Reformulation
Hierarchies can find embeddedstructure.Hierarchies can constrain
policiesand therefore exploration.
Deep Hierarchypickup actions only possible beforeor after
Navigate tasks.
...
west
Location1pickupA
north south
GatherA
Root
east
Navigate
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
BC
A
D
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Outline
1 Introduction
2 Exploration and Approximation
3 Exploration and Hierarchy
4 ConclusionFuture WorkSummary
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Discovering Abstractions
Turn Left
Reach Highway
Drive to Campus
...
...
We can now use task hierarchies toefficiently explore
continuousenvironments.Can we discover composite
tasksautomatically?
What makes a good subtask?
Other research: “bottleneck states”My conjecture: “sets of
relevant features”(Jong and Stone, 2005)
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Prior Distributions Over Inductive Biases
No Free LunchNo algorithm can learn or discover efficiently in
all possibleworlds!
Bayesian Reinforcement LearningBegin with a prior distribution
over environmentsPlan over “belief states”Update belief
distribution given data
Key question What is the right prior distribution?Conjecture
Distributions over task hierarchies
Goal Efficient appoximation of optimal Bayesiansolution
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Natural Knowledge Representations
Model-free methods learn a monolithic value function.Models are
a natural form of domain knowledge.Models are modular: piecewise
independent.
Don’t Reinvent the WheelExploit known reward functionExploit
known dynamics of some actionsExploit known dynamics of some state
variables
Nicholas K. Jong Structured Exploration for Reinforcement
Learning
-
university-logo
IntroductionExploration and Approximation
Exploration and HierarchyConclusion
Future WorkSummary
Connections to Other Fields of Arti