Page 1
CS343:ArtificialIntelligenceAdvancedApplications:Robotics
Prof.ScottNiekum—TheUniversityofTexasatAustin [TheseslidesbasedonthoseofDanKleinandPieterAbbeelforCS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]
Page 2
RoboticHelicopters
Page 3
MotivatingExample
■ How do we execute a task like this?
Page 4
AutonomousHelicopterFlight
▪ Keychallenges:
▪ Trackhelicopterpositionandorientationduringflight
▪ Decideoncontrolinputstosendtohelicopter
Page 5
AutonomousHelicopterSetup
On-boardinertialmeasurementunit(IMU)
Sendoutcontrolstohelicopter
Position
Page 6
HMMforTrackingtheHelicopter
▪ State:
▪ Measurements:[observationupdate]▪ 3-Dcoordinatesfromvision,3-axismagnetometer,3-axisgyro,3-axisaccelerometer
▪ Transitions(dynamics):[timeelapseupdate]▪ st+1=f(st,at)+wt f: encodes helicopter dynamics, w: noise
Page 7
HelicopterMDP
▪ State:
▪ Actions(controlinputs):▪ alon:Mainrotorlongitudinalcyclicpitchcontrol(affectspitchrate) ▪ alat:Mainrotorlatitudinalcyclicpitchcontrol(affectsrollrate) ▪ acoll:Mainrotorcollectivepitch(affectsmainrotorthrust) ▪ arud:Tailrotorcollectivepitch(affectstailrotorthrust)
▪ Transitions(dynamics):▪ st+1=f(st,at)+wt [f encodes helicopter dynamics] [w is a probabilistic noise model]
▪ CanwesolvetheMDPyet?
Page 8
Problem:What’stheReward?
▪ Rewardforhovering:
Page 9
Hover
[Ngetal,2004]
Page 10
Problem:What’stheReward?
▪ Rewardsfor“Flip”?
▪ Problem:what’sthetargettrajectory?
▪ Justwriteitdownbyhand?
▪ Penalizefordeviationfromtrajectory
Page 12
HelicopterApprenticeship?
!12
Page 13
UnalignedDemonstrations
Page 14
ProbabilisticAlignmentusingaBayes’Net
▪ Intendedtrajectorysatisfiesdynamics.
▪ Experttrajectoryisanoisyobservationofoneofthehiddenstates.▪ Butwedon’tknowexactlywhichone.
Intended trajectory
Expert demonstrations
Time indices
[Coates,Abbeel&Ng,2008]
Page 15
AlignedDemonstrations
Page 16
FinalBehavior
[Abbeel,Coates,Quigley,Ng,2010]
Page 18
Quadruped
▪ Low-levelcontrolproblem:movingafootintoanewlocation! searchwithsuccessorfunction~movingthemotors
▪ High-levelcontrolproblem:whereshouldweplacethefeet?
▪ RewardfunctionR(x)=w.f(s)[25features]
[Kolter,Abbeel&Ng,2008]
Page 19
▪ Demonstratepathacrossthe“trainingterrain”
▪ Runapprenticeshiptolearntherewardfunction
▪ Receive“testingterrain”---heightmap.
▪ Findtheoptimalpolicywithrespecttothelearnedrewardfunctionforcrossingthetestingterrain.
Experimentalsetup
[Kolter,Abbeel&Ng,2008]
Page 20
Learning task objectives: Inverse reinforcement learning
Reinforcement learning basics:
MDP: (S,A, T, �, D,R)
⇡(s, a) ! [0, 1]Policy:
Value function: V ⇡(s0) =1X
t=0
�tR(st)
states actions transition dynamics
discount rate start statedistribution
reward function
What if we have an MDP/R?
Page 21
Learning task objectives: Inverse reinforcement learning
2. Explain expert demos by finding such that:R⇤
1. Collect user demonstration (s0, a0), (s1, a1), . . . , (sn, an)
⇡Eand assume it is sampled from the expert’s policy,
E[P1
t=0 �tR⇤(st)|⇡E ] E[
P1t=0 �
tR⇤(st)|⇡]
8⇡Es0⇠D[V ⇡E
(s0)] Es0⇠D[V ⇡(s0)]
8⇡�
�
How can search be made tractable?
[Abbeel and Ng 2004]
Page 22
Learning task objectives: Inverse reinforcement learning
Define R⇤ as a linear combination of features:
R⇤(s) = wT�(s) , where � : S ! Rn
Then,
E[P1
t=0 �tR⇤(st)|⇡] = E[
P1t=0 �
twT�(st)|⇡]
= wTE[P1
t=0 �t�(st)|⇡]
= wTµ(⇡)
Thus, the expected value of a policy can be expressed as a weighted sum of the expected features µ(⇡)
[Abbeel and Ng 2004]
Page 23
Learning task objectives: Inverse reinforcement learning
Originally -
E[P1
t=0 �tR⇤(st)|⇡E ] E[
P1t=0 �
tR⇤(st)|⇡] 8⇡
Restated - find such that:
Explain expert demos by finding such that:R⇤
w⇤
w⇤µ(⇡E) w⇤µ(⇡)
�
�
Use expected features:
8⇡
E[P1
t=0 �tR⇤(st)|⇡] = wTµ(⇡)
[Abbeel and Ng 2004]
Page 24
Learning task objectives: Inverse reinforcement learning
Find such that:w⇤ 8⇡
2. Find s.t. expert maximally outperforms all previously
1. Initialize to any policy⇡0
Iterate for i =1, 2, … :
⇡i
examined policies :⇡0...i�1
w⇤µ(⇡E) � w⇤µ(⇡j) + ✏s.t.
w⇤µ(⇡E) w⇤µ(⇡)�
3. Use RL to calc. optimal policy associated with
max
✏,w⇤:kw⇤k21✏
w⇤
4. Stop if ✏ threshold
Goal:
w⇤
[Abbeel and Ng 2004]
Page 25
Learning task objectives: Inverse reinforcement learning
Find such that:w⇤ 8⇡
2. Find s.t. expert maximally outperforms all previously
1. Initialize to any policy⇡0
Iterate for i =1, 2, … :
⇡i
examined policies :⇡0...i�1
w⇤µ(⇡E) � w⇤µ(⇡j) + ✏s.t.
w⇤µ(⇡E) w⇤µ(⇡)�
3. Use RL to calc. optimal policy associated with
max
✏,w⇤:kw⇤k21✏
w⇤
4. Stop if ✏ threshold
Goal:
w⇤
[Abbeel and Ng 2004]
SVM solver
Page 27
With learned reward function
Page 28
Robotic manipulation
Page 29
Demonstration
[RSS 2013, IJRR 2015]
Page 30
High-level task modeling
?
Unsegmented demonstrationsof multi-step tasks
Finite-state taskrepresentation
• How many skills?• Parameters of skills / controllers?• How to sequence intelligently?
• Superior generalization of skills• Handle contingencies• Adaptively sequence skills
Why?
Questions
Page 31
System overview
[IROS 2012]
Page 32
Joint angles
Gripper pose
Stereo data
System overview
[IROS 2012]
Page 33
Taskdemos
Joint angles Forward kinematics
Object recognition
Gripper pose
Stereo data
System overview
[IROS 2012]
Page 34
Taskdemos
Joint angles Forward kinematics
Object recognition
Preprocessing /BP-AR-HMMsegmentation Segmented
motionsGripper pose
Stereo data
System overview
[IROS 2012]
Page 35
y1 y2 y3 y4 y5 y6 y7 y8
Segmenting demonstrations
x1 x2 x3 x4 x5 x6 x7 x8Motioncategories
Observations
Standard Hidden Markov Model
[IROS 2012]
Page 36
y1 y2 y3 y4 y5 y6 y7 y8
x1 x2 x3 x4 x5 x6 x7 x8
Observations
Autoregressive Hidden Markov Model
[IROS 2012]
Segmenting demonstrations
y(i)t =
rX
j=1
Aj,z(i)
ty(i)t�j + e(i)t (z(i)t )
Motioncategories
Page 37
6 6 3 1 1 3 11 10
y1 y2 y3 y4 y5 y6 y7 y8Observations
[IROS 2012]
Segmenting demonstrations
Autoregressive Hidden Markov Model
y(i)t =
rX
j=1
Aj,z(i)
ty(i)t�j + e(i)t (z(i)t )
Motioncategories
Page 38
6 6 3 1 1 3 11 10
y1 y2 y3 y4 y5 y6 y7 y8Observations
[IROS 2012]
Segmenting demonstrations
Autoregressive Hidden Markov Model
y(i)t =
rX
j=1
Aj,z(i)
ty(i)t�j + e(i)t (z(i)t )
Motioncategories
Page 39
Autoregressive Hidden Markov Model
6 6 3 1 1 3 11 10
y1 y2 y3 y4 y5 y6 y7 y8Observations
unknown number!
Beta Process
[IROS 2012]
Segmenting demonstrations
y(i)t =
rX
j=1
Aj,z(i)
ty(i)t�j + e(i)t (z(i)t )
(Fox et al. 2011)
Motioncategories
Page 40
Taskdemos
Joint angles Forward kinematics
Object recognition
Preprocessing /BP-AR-HMMsegmentation Segmented
motions
Coordinate frame detection Frame-
labeledsegments
Gripper pose
Stereo data
Red object coord. frame
System overview
[IROS 2012]
Page 41
Taskdemos
Joint angles Forward kinematics
Object recognition
Preprocessing /BP-AR-HMMsegmentation Segmented
motions
Coordinate frame detection Frame-
labeledsegments
Gripper pose
Stereo data
Learning from demonstration DMPs
with frame-relative goals
Red object coord. frame
System overview
[IROS 2012]
Page 42
Taskdemos
Joint angles Forward kinematics
Object recognition
Preprocessing /BP-AR-HMMsegmentation Segmented
motions
Coordinate frame detection Frame-
labeledsegments
Gripper pose
Stereo data
Learning from demonstration DMPs
with frame-relative goals
Joint angles
Gripper pose
Stereo data
Realtime data froma novel task
Forward kinematics
Object recognition
Red object coord. frame
System overview
[IROS 2012]
Page 43
Taskdemos
Joint angles Forward kinematics
Object recognition
Preprocessing /BP-AR-HMMsegmentation Segmented
motions
Coordinate frame detection Frame-
labeledsegments
Gripper pose
Stereo data
Learning from demonstration DMPs
with frame-relative goals
DMP planning /inverse kinematics Joint trajectory
spline controller
Joint angles
Gripper pose
Stereo data
Realtime data froma novel task
Forward kinematics
Object recognition
Red object coord. frame
System overview
[IROS 2012]
Page 44
Learning a task plan: Finite state automata
[RSS 2013, IJRR 2015]
Page 45
Learning a task plan: Finite state automata
Controller built from motion category examples
Classifier built from robot percepts
[RSS 2013, IJRR 2015]
Page 46
Interactive corrections
[RSS 2013, IJRR 2015]
Page 47
Replay with corrections: missed grasp
[RSS 2013, IJRR 2015]
Page 48
Replay with corrections: too far away
[RSS 2013, IJRR 2015]
Page 49
Replay with corrections: full run
[RSS 2013, IJRR 2015]