Reinforcement Learning – Overview of Recent Progress and Implications for Process Control and Beyond (Integrated Multi-Scale Decision-making) October 4, 2018 CMU EWO Webinar Jay H. Lee 1 (w/ Thomas Badgwell 2 , ) 1 Korea Advanced Institute of Science and Technology, Daejeon, Korea 2 ExxonMobil Research & Engineering Company, Clinton, NJ
66
Embed
Reinforcement Learning Overview of Recent Progress and ...egon.cheme.cmu.edu/ewo/docs/EWO_Seminar_10_04_2018.pdf · Reinforcement Learning –Overview of Recent Progress and Implications
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Reinforcement Learning – Overview of Recent Progress and Implications for Process Control and Beyond (Integrated Multi-Scale Decision-making)
October 4, 2018 CMU EWO Webinar
Jay H. Lee1
(w/ Thomas Badgwell2, )
1 Korea Advanced Institute of Science and Technology, Daejeon, Korea2ExxonMobil Research & Engineering Company, Clinton, NJ
Introduction to KAIST – 47 Years Old
KAISTBusiness School
71’ KAIS
SEOUL
Graduate school in Seoul under a new law granting special privileges such as exemption from compulsory military service
1971KAIS, Korea Advanced Institute of Science
Undergraduate school in Daejeon for students gifted in math and science
1984KIT, Korea Institute of Technology
Established through the merging of KAIS and KIT
1989KAIST, Korea Advanced Institute of Science and Technology
KAISTMain Campus
84’ KIT
DAEJEON
3
KAIST Today Brief Statistics
06/43
• Part I: Introduction of Reinforcement Learning and Implications for Process Control
(Acknowledgment: Thomas Badgwell)
• Part II: And Beyond (Integrated Multi-Scale Decision-making)
Overall Structure of This Talk
5
𝜃𝑗
𝒙 𝑦input output
Environment
𝓓 = 𝒙1:𝑛, 𝑦1:𝑛, 𝜃1:𝑛
𝒚 = 𝒇(𝒙; 𝜃)
𝒙∗|𝜃 = argmax𝒙
𝒇(𝒙; 𝜃)
Target system
Data acquisition
Learning
Making decision
𝑓(𝒙; 𝜃𝑗)
Data-driven decision-making & control in engineering domain
Dynamic & Stochastic environment
Data can help model more realistically and derive more accurate solution!
Are we building the right model? Does the algorithm capture all the
essential aspects of the model?
Data analytics
• Bayesian Statistics
• Machine learning
• Bayesian Network
Modeling
• Optimization
• Markov Decision Process
• Game Theory
Decision Making
• Mathematical Programming
• Dynamic Programming
• Reinforcement Learning
Agenda
8
What is Reinforcement Learning? RL vs. Model Predictive Control
Implications for Process Control Future Research Directions
The agent learns a policy 𝜋 𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠 that maximizes a long-termvalue function:
𝑣𝜋 𝑠 = 𝐸 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾2𝑅𝑡+3⋯|𝑆𝑡 = 𝑠
[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016)
In Reinforcement Learning, an agent learns a decision policy by taking actions and observing the response of (‘reward’ or ‘penalty’ from) the environment (abstracted from Animal Psychology).
• Bellman’s optimality equation answers the question: when is the value function 𝑣∗ 𝑠maximized? It enforces consistency of the optimal value function as the state of the environment changes [6].
𝑣∗ 𝑠 =𝑚𝑎𝑥𝑎
𝑠,,𝑟
𝑝 𝑠′, 𝑟|𝑠, 𝑎 𝑟 + 𝛾𝑣∗ 𝑠′
• In practice Bellman’s optimality equation usually cannot be solved because:
➢We often don’t know environment model 𝑝 𝑠′, 𝑟|𝑠, 𝑎
➢Solution complexity explodes with state dimension (which may be infinite)
• All RL algorithms can be regarded as approximate solutions to Bellman’s optimality equation, dealing in various ways with these two limitations [2]. 10
[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016)
The properties of an optimal policy are described by Bellman’s optimality equation (from Optimal Control theory)
Penalty+x: Every time it moves forward.- 0.1: Apply leg motor torque.-100: Fall down.
= [ hull angle speed, angular velocity,
horizontal speed, vertical speed,
position of joints, joints angular speed,
legs contact with ground,
10 LIDAR rangefinder measurements ]
= [ leg joint torques]
Snapshot after 3k episodes Snapshot after 40k episodes
A surprising policy! Better algorithms for tougher tasks
Agenda
21
What is Reinforcement Learning?RL vs. Model Predictive Control
Implications for Process ControlFuture Research Directions
Reinforcement Learning has advantages and disadvantages when compared with Model Predictive Control
22
RL (model-free) advantages vs. MPC
• No need to develop process model (develop policy from data directly)
• Able to work with complex nonlinear, stochasticenvironments
• Fast on-line execution• Can adapt to changing environments
RL (model-free) disadvantages vs. MPC
• Extensive trial-and-error learning is required• Must be allowed to fail during training (simulation can
be used)• Training may not be stable or repeatable• May get stuck in local minima during training• Extensive goal engineering may be required• Must re-do training if goal is changed• No closed-loop stability guarantees
Agenda
23
What is Reinforcement Learning? RL vs. Model Predictive Control
Implications for Process ControlFuture Research Directions
MPC is the current state-of-the art for chemical plants. Reinforcement Learning has the potential to complement it to expand its capability.
24
Potential process control applications of RL technology include:
• Directly replace existing process controllers with RL agents
• Use RL agents to help manage process control systems
✓ Switch controllers ON/OFF , adjust limits and tuning parameters as appropriate
✓ Compensate for common disturbances such as weather events or feedrate changes
✓ Supervising / optimizing control, esp. those involving significant uncertainty
• Use RL agents to advise operators during unusual situations (process upsets, startup/shutdown)
• Use a hierarchy of RL agents to simplify operation of a chemical plant/refinery
✓ Process safety agents
✓ Environmental compliance agents
✓ Reliability agents
✓ Economic optimization agents
Agenda
25
What is Reinforcement Learning?Bipedal Robot
RL vs. Model Predictive Control
Implications for Process ControlFuture Research Directions
Reinforcement Learning research opportunities
26
Potential RL research opportunities include:
• RL methods with “Disciplined learning”.
• Integrate aspects of RL technology with MPC
➢ Lee and Wong [8], Morinelly and Ydstie [9], Kamthe and Deisenroth [10]
• Exploration vs. exploitation
• Find the class of function approximations for which a state-of-the-art RL algorithm (e.g. A3C [11]) converges (value and policy function approximations converge to optimal values)
• Prove closed-loop stability for the case of a state-of-the-art RL algorithm (e.g. A3C [11])
• Develop a robust RL algorithm by training in parallel with a number of environments that each represent a realization of the uncertainty set
• Develop RL technology that allows a hierarchy of prioritized RL agents to cooperate to accomplish a complex task.
+
• The state is defined as:
• Stage-wise cost:
27
Exemplary RL with “Disciplined” Learning:Integrated Reaction Separation System with Recycle
(Tosukhowong and Lee, AIChE J. 2009)
Operating Modes 1 2 3 4
Product conc. (x2B)
Production rate (B)
0.886
100
0.85
115
0.91
80
0.82
125
32
21
2
1
⎯→⎯
⎯→⎯
k
k
MR
MD
MB
D
L
F
V
B, x2B
F0
x1
Qu = 10000, Qy = 6000, R = 20I66
0u /100T
sp sp sp
R D BF M M M L B =
28
Stochastic Disturbances and Variations
On-line stochastic disturbance (one of the realizations)
On-line performance comparison with 12 new disturbance
realizations
Qu = 10000, Qy = 6000, R = 20I66
0u /100T
sp sp sp
R D BF M M M L B =
RL
7 NMPC controllers w/ different tunings
Learned starting w/ closed-loop data with 7NMPC controllers
30
RL controller result from one realization
(x2Bsp = 0.85, Bsp = 115)
Product variables Manipulated variables
31
Result of NMPC 1 (the best MPC controller in this case)
Product variables Manipulated variables
Reinforcement Learning with
Mathematical Programmingfor
Multi-Scale Dynamic Decision Makingin an Uncertain Environment
Dynamic decision-making in an uncertain environment
Decision-maker Uncertain
Environment
Execution
Response
State
Information
Decision
Iterative
Improve
ment?
Industrial & Manufacturing system,
Financial engineering,
Robotics,
Power systems,
Medical applications,
Computing & Communications,
Game playing, …
<Sequential decision-making process>
- under uncertainty
Multi-scale decision-making
How do we integrate between layers?
Grossmann (2005). Enterprise‐wide optimization: A new frontier in PSE. AIChE Journal, 51(7), 1846-1857.3
Math programming over time-scale multiplicity
hour
. . . . .day
. . . .
year. . . . .
0
1000
0 30 60 90 120150180210240270300330360
Daily operating cost
Yearly
average
Yearly capacity planning (sizing)
Daily production planning
Hourlydispatch
scheduling
Fixed
design
(capacity)
Operation
constraints
day-ahead
operation/
uncertainty
info.
year-ahead
operation
info.
Day-
ahead
prediction
Wind: summer Wind: winter
7
Renewable (wind) energy example
Temporally-integrated mathematical programming (MP)❖ At fine scale: 1 year = 24 X 365 = 8760 hours ➔ Computationally infeasible
❖ At coarse scale: Coarse-graining and “averaging” of hourly dynamics and uncertainty ➔ Optimistic estimation
➔ “A gap between the layers”
Uncertainty handling: MP vs. MDP
Math programming (MP): solution “over” a time
horizon
Markov Decision Process (MDP): “stage-wise”
solution
❖ Stochastic data: scenario tree
❖ Solution structure: decision tree 𝑥1, 𝑥2𝜔1, … , 𝑥𝑇
𝜔𝑇−1❖ Stochastic data: probability distribution
❖ Solution structure: decision policy 𝝅: 𝑺 → 𝑿
state transition
Birge, & Louveaux (2011). Introduction to stochastic
Order space {0,5,10} {0,5,10} {0,10,15} {0,5} {0,10,15}
Fixed cost 1.5 1.7 1.4 2 2
Variable cost 0.5 0.5 0.5 0.5 0.5
❖ Five Suppliers:
Unreliable, but cheap Reliable, but expensive
Policy 1 Policy 2 Policy 3
Average Cost 606.75 - 519.46
Improvement over
Policy 1 (%)- - 14.46
CPU time (s) 21.13 - 1247.10
❖ Results of Case 2 (100 simulation average)
❖ MILP model ✓ # of var. : 705 (integer: 310)
✓ # of constraints: 1395
❖ MDP model ✓ Size of state space: 4424
✓ Size of decision space: 136
Computationally
infeasible !
Policy 1 Policy 2 Policy 3
Planning mod
el
MDP with
safety stockMDP MDP
Uncertainty
accountedDemand
Demand &
Lead time
Demand &
Lead time
Integration wit
h
scheduling
No MILP modelHeuristic
approach
Solution
algorithmExact VI* Exact VI
Approximate
VI
Case 1:
Moderate size
Case 2:
Large size
Decision
horizon
Scheduling 10 30 (one-month)
Planning 10 12 (one-year)
Number of suppliers 3 5
Unit cost
/ penalty
Inventory holdin
g0.1 0.1
Unloading 3 5
Lost-demand 10 10
Lost-volume 5 5
Safety-stock 10 10
Summary
64
• Reinforcement Learning has its technical roots in animal psychology and optimal control. Recent advances in RL algorithms, Deep NN, and hardware have enabled superhuman performance for some applications.
• Reinforcement Learning has advantages/disadvantages relative to Model Predictive Control, mostly because it emphasizes development of a control policy rather than a process model.
✓ Extensive off-line learning is possible if a good simulator is available. For on-line learning, “disciplined learning” is needed.
• Potential Process Control applications of Reinforcement Learning include:
✓ Use RL agents to manage control systems or optimize under uncertainty✓ Use RL agents to advise operators during unusual situations✓ Use a hierarchical network of RL agents to simplify operation of a plant
✓ Use RL agents to integrate strategic business decisions with plant operation decisions
Summary
Summary
65
We believe that Reinforcement Learning has the potential to significantly impact both theory and practice of Process Control, and more generally in Integrated Strategic /Operational Decision Making!
Summary
66
Thank you for listening
Matthew Realff (GT) Joohyun Shin (KAIST)Thomas Badgwell (ExxonMobil)
NOT
PICTURED
References
67
[1] D. Silver et al., Mastering the game of Go with deep neural networks and tree search, Nature, 529, 484-489, (2016).
[2] R. Sutton and A. Barto, Reinforcement Learning, Second Edition draft, (2016).
[3] D. Silver, Lecture 1: Introduction to Reinforcement Learning, Google DeepMind, (2015).
[4] S. Levine, Deep Reinforcement Learning, Berkeley CS294-112, (2017).
[5] A. Turing, Computing machinery and intelligence. Mind, 59, 433-460, (1950).
[6] OpenAI Gym, A toolkit for developing and comparing reinforcement learning algorithms, (2018).
[7] T. Badgwell, K. Liu, N. Subrahmanya, W. Liu, and M. Kovalski, Adaptive PID Controller Tuning via Deep Reinforcement Learning,
U.S. provisional patent application filed, (2017).
[8] J. Lee and W. Wong, Approximate dynamic programming approach for process control, Journal of Process Control, 20, 1038-
1048, (2010).
[9] J. Morinelly and E. Ydstie, Dual MPC with Reinforcement Learning, IFAC Papers Online, j.ifacol.2016,07.276, (2016).
[10] S. Kamthe and P. Deisenroth, Data-Efficient Reinforcement Learning with Probabilistic Model Predictive Control,
arXiv:1706.06491v1, (2017).
[11] V. Mnih et al., Asynchronous Methods for Deep Reinforcement Learning, Proceedings of the 33rd International Conference on
Machine Learning, New York, NY, USA, (2016).[12] Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
[13] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.
[14] Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic programming for feedback control. IEEE circuits and
systems magazine, 9(3).
[15] Lee, J. H., Shin, J., & Realff, M. J. (2018). Machine learning: Overview of the recent progresses and implications for the process systems
engineering field. Computers & Chemical Engineering, 114, 111-121.