Lecture 1: Introduction to Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2019 1 Today the 3rd part of the lecture is based on David Silver’s introduction to RL slides Emma Brunskill (CS234 Reinforcement Learning) Lecture 1: Introduction to Reinforcement Learning Winter 2019 1 / 74
74
Embed
Lecture 1: Introduction to Reinforcement Learning [1]Today ... · t World updates given action a t, emits observation o t and reward r t Agent receives observation o t and reward
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lecture 1: Introduction to Reinforcement Learning 1
Emma Brunskill
CS234 Reinforcement Learning
Winter 2019
1Today the 3rd part of the lecture is based on David Silver’s introduction to RL slidesEmma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 1 / 74
Today’s Plan
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 2 / 74
Reinforcement Learning
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 3 / 74
Repeated Interactions with World
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 4 / 74
Reward for Sequence of Decisions
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 5 / 74
Don’t Know in Advance How World Works
Learn to make good sequences of decisions
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 6 / 74
Fundamental challenge in artificial intelligence and machine learning islearning to make good decisions under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 7 / 74
RL, Behavior & Intelligence
Figure: Example from Yael Niv
Childhood: primitive brain & eye, swims around, attaches to a rock
Adulthood: digests brain, sits
Suggests brain is helping guide decisions (no more decisions, no needfor brain?)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 8 / 74
Atari
Figure: DeepMind Nature, 2015
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 9 / 74
Loss functions, deritatives, gradient descent should be familiar
Have heard of Markov decision processes and RL befor in an AI or MLclass
We will cover the basics, but quickly
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 31 / 74
End of Class Goals
Define the key features of reinforcement learning that distinguish it from AIand non-interactive machine learning (as assessed by the exam)
Given an application problem (e.g. from computer vision, robotics, etc)decide if it should be formulated as a RL problem, if yes be able to define itformally (in terms of the state space, action space, dynamics and rewardmodel), state what algorithm (from class) is best suited to addressing it, andjustify your answer. (as assessed by the project and the exam)
Implement (in code) common RL algorithms including a deep RL algorithm(as assessed by the homeworks)
Describe (list and define) multiple criteria for analyzing RL algorithms andevaluate algorithms on these metrics: e.g. regret, sample complexity,computational complexity, empirical performance, convergence, etc. (asassessed by homeworks and the exam)
Describe the exploration vs exploitation challenge and compare and contrastat least two approaches for addressing this challenge (in terms ofperformance, scalability, complexity of implementation, and theoreticalguarantees) (as assessed by an assignment and the exam)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 32 / 74
Grading
Assignment 1
Assignment 2
Assignment 3
Midterm
Quiz
Final Project
ProposalMilestonePoster presentationFinal Report
10%
20%
15%
25%
5%
25%
1%
3%
5%
16%
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 33 / 74
Communication
We believe students often learn an enormous amount from each otheras well as from us, the course staff.
We will use Piazza to facilitate discussion and peer learning
Please use Piazza for all questions related to lectures, homeworks,and projects
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 34 / 74
Grading
Late policy
6 free late daysSee webpage for details on how may per assignment/project andpenalties for using more
Collaboration: see webpage and reach out to us if you have anyquestions about what is considered permissible collaboration
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 35 / 74
Today’s Plan
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 36 / 74
Sequential Decision Making
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 37 / 74
Example: Web Advertising
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 38 / 74
Example: Robot Unloading Dishwasher
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 39 / 74
Example: Blood Pressure Control
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
May require strategic behavior to achieve high rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 40 / 74
Sequential Decision Process: Agent & the World (DiscreteTime)
Each time step t:
Agent takes an action atWorld updates given action at , emits observation ot and reward rtAgent receives observation ot and reward rt
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 41 / 74
History: Sequence of Past Observations, Actions &Rewards
History ht = (a1, o1, r1, . . . , at , ot , rt)
Agent chooses action based on history
State is information assumed to determine what happens next
Function of history: st = (ht)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 42 / 74
World State
This is true state of the world used to determine how world generatesnext observation and reward
Often hidden or unknown to agent
Even if known may contain information not needed by agent
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 43 / 74
Agent State: Agent’s Internal Representation
What the agent / algorithm uses to make decisions about how to act
Generally a function of the history: st = f (ht)
Could include meta information like state of algorithm (how manycomputations executed, etc) or decision process (how many decisionsleft until an episode ends)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 44 / 74
Markov Assumption
Information state: sufficient statistic of history
State st is Markov if and only if:
p(st+1|st , at) = p(st+1|ht , at)
Future is independent of past given present
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 45 / 74
Why is Markov Assumption Popular?
Can always be satisfied
Setting state as history always Markov: st = ht
In practice often assume most recent observation is sufficient statisticof history: st = otState representation has big implications for:
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 46 / 74
Full Observability / Markov Decision Process (MDP)
Environment and world state st = ot
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 47 / 74
Partial Observability / Partially Observable MarkovDecision Process (POMDP)
Agent state is not the same as the world state
Agent constructs its own state, e.g.
Use history st = ht , or beliefs of world state, or RNN, ...
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 48 / 74
Partial Observability Examples
Poker player (only sees own cards)
Healthcare (don’t see all physiological processes)
Agent state is not the same as the world state
Agent constructs its own state, e.g.Use history st = ht , or beliefs of world state, or RNN, ...
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 49 / 74
Types of Sequential Decision Processes: Bandits
Bandits: actions have no influence on next observations
No delayed rewards
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 50 / 74
Types of Sequential Decision Processes: MDPs andPOMDPs
Actions influence future observations
Credit assignment and strategic actions may be needed
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 51 / 74
Types of Sequential Decision Processes: How the WorldChanges
Deterministic: Given history and action, single observation & rewardCommon assumption in robotics and controls
Stochastic: Given history and action, many potential observations &reward
Common assumption for customers, patients, hard to model domains
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 52 / 74
RL Agent Components
Often include one or more of
Model: Agent’s representation of how the world changes in responseto agent’s actionPolicy: function mapping agent’s states to actionValue function: future rewards from being in a state and/or actionwhen following a particular policy
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 53 / 74
Model
Agent’s representation of how the world changes in response toagent’s action
Transition / dynamics model predicts next agent state
p(st+1 = s ′|st = s, at = a)
Reward model predicts immediate reward
r(st = s, at = a) = E[rt |st = s, at = a]
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 54 / 74
Policy
Policy π determines how the agent chooses actions
π : S → A, mapping from states to actions
Deterministic policy:π(s) = a
Stochastic policy:
π(a|s) = Pr(at = a|st = s)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 55 / 74
Value
Value function V π: expected discounted sum of future rewards undera particular policy π
Numbers show value V π(s) for this policy and this discount factor
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 59 / 74
Example: Mars Rover Model
!" !# !$ !% !& !' !(
*̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0
Agent can construct its own estimate of the world models (dynamicsand reward)
In the above the numbers show the agent’s estimate of the rewardmodel
Agent’s transition model
0.5 = P(s1|s1, right) = P(s2|s1, right) = · · ·
Model may be wrong
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 60 / 74
Types of RL Agents: What the Agent (Algorithm) Learns
Valued-based
Explicit: Value functionImplicit: Policy (can derive a policy from value function)
Policy-based
Explicit: policyNo value function
Actor-Critic
Explicit: PolicyExplicit: Value function
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 61 / 74
Types of RL Agents
Model-based
Explicit: ModelMay or may not have policy and/or value function
Model-free
Explicit: Value function and/or policy functionNo model
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 62 / 74
RL Agents
Figure: Figure from David Silver RL course
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 63 / 74
Key Challengers in Learning to Make Sequences of GoodDecisions
Planning (Agent’s internal computation)Given model of how the world works
Dynamics and reward model
Algorithm computes how to act in order to maximize expected reward
With no interaction with real environment
Reinforcement learning
Agent doesn’t know how world worksInteracts with world to implicitly/explicitly learn how world worksAgent improves policy (may involve planning)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 64 / 74
Planning Example
Solitaire: single player card game
Know all rules of game / perfect model
If take action a from state s
Can compute probability distribution over next stateCan compute potential score
Can plan ahead to decide on optimal action
E.g. dynamic programming, tree search, ...
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 65 / 74
Reinforcement Learning Example
Solitaire with no rule book
Learn directly by taking actions and seeing what happens
Try to find a good policy over time (that yields high reward)
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 66 / 74
Exploration and Exploitation
Agent only experiences what happens for the actions it tries
Mars rover trying to drive left learns the reward and next state fortrying to drive left, but not for trying to drive rightObvious! But leads to a dilemma
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 67 / 74
Exploration and Exploitation
Agent only experiences what happens for the actions it tries
How should an RL agent balance its actions?
Exploration: trying new things that might enable the agent to makebetter decisions in the futureExploitation: choosing actions that are expected to yield good rewardgiven past experience
Often there may be an exploration-exploitation tradeoff
May have to sacrifice reward in order to explore & learn aboutpotentially better policy
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 68 / 74
Exploration and Exploitation Examples
Movies
Exploitation: Watch a favorite movie you’ve seen beforeExploration: Watch a new movie
Advertising
Exploitation: Show most effective ad so farExploration: Show a different ad
Driving
Exploitation: Try fastest route given prior experienceExploration: Try a different route
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 69 / 74
Evaluation and Control
Evaluation
Estimate/predict the expected rewards from following a given policy
Control
Optimization: find the best policy
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 70 / 74
Example: Mars Rover Policy Evaluation
!" !# !$ !% !& !' !(
Policy represented by arrows
π(s1) = π(s2) = · · · = π(s7) = right
Discount factor, γ = 0
What is the value of this policy?
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 71 / 74
Example: Mars Rover Policy Control
!" !# !$ !% !& !' !(
Discount factor, γ = 0
What is the policy that optimizes the expected discounted sum ofrewards?
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 72 / 74
Course Outline
Markov decision processes & planning
Model-free policy evaluation
Model-free control
Value function approximation & Deep RL
Policy Search
Exploration
Advanced Topics
See website for more details
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 73 / 74
Summary
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 Reinforcement Learning)Lecture 1: Introduction to Reinforcement Learning 1 Winter 2019 74 / 74