ONLINE TRAJECTORY PLANNING ALGORITHMS FOR ROBOTIC SYSTEMS UNDER UNCERTAINTY IN INTERACTIVE ENVIRONMENTS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF AERONAUTICS AND ASTRONAUTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Haruki Nishimura August 2021
167
Embed
ONLINE TRAJECTORY PLANNING ALGORITHMS FOR ROBOTIC …
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ONLINE TRAJECTORY PLANNING ALGORITHMS FOR ROBOTIC SYSTEMS
UNDER UNCERTAINTY IN INTERACTIVE ENVIRONMENTS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF AERONAUTICS AND
ASTRONAUTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Haruki Nishimura
August 2021
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: https://purl.stanford.edu/jp145pv0596
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Mac Schwager, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Grace Gao
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Mykel Kochenderfer
Approved for the Stanford University Committee on Graduate Studies.
Stacey F. Bent, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
The mission of this thesis is to develop algorithms for planning and control of intelligent mobile
robots that operate autonomously in open, interactive environments. Presence of other agents
and objects in such an environment makes planning significantly challenging, as they inevitably
bring about environmental and dynamic uncertainty that the robot must properly handle. Despite
recent advances in perception, planning and control, many existing robotic systems to date lack
the capability to consider and address uncertainty, which demands that the robots be caged or
confined to a dedicated, structured workspace. For example, success of thousands of mobile robots
nowadays deployed in logistics centers is heavily reliant on their closed and controlled operating
environments. In this thesis, we propose a series of computationally efficient algorithms that can
collectively overcome uncertainty of various sources towards reliable autonomy for “cage-free” robotic
operations. The methods presented in the thesis leverage probability theory to quantify the amount
of present and future uncertainty. Based on the quantification, we develop planning and control
algorithms that either mitigate, avoid the risk of, or are robust against uncertainty so that the robot
can successfully accomplish a given task. We take a model-based approach in developing those
algorithms, which allows us to exploit physical properties of dynamical systems and onboard sensors
when possible. Another crucial aspect of the proposed methods is their online nature, meaning that
control signals are computed in situ based on the currently available information. This is enabled
by fast, efficient computation of our algorithms, and is advantageous in that the robot can quickly
react to rapidly changing environments.
In the first part of the thesis, we address challenges associated with state uncertainty, which
represents unknowns about the current state of the system of interest. This can include unknown
intent of other interacting agents, or positions of targets to locate. We propose and employ recursive
Bayesian inference frameworks to keep track of evolving state uncertainty over time. The proposed
planning algorithms further assist the inference frameworks to actively mitigate state uncertainty as
appropriate, so that the robot can execute suitable control actions with certainty. We leverage tools
from sequential decision-making and optimal control to develop those algorithms. We demonstrate
the effectiveness of our approach in a multitude of tasks that involve state uncertainty, with different
combinations of dynamical systems and sensing modalities. This includes vision-based active intent
iv
inference, active target tracking with range-only observations, and simultaneous object manipulation
and parameter estimation.
We then turn our attention to transition uncertainty, which governs the unpredictability of future
states of the system. We especially focus on safety-critical problems where transition uncertainty
must not be ignored. For instance, a robot navigating in close proximity to humans has to carefully
perform planning so that collisions are avoided with high confidence. We take a risk-aware planning
approach, in which a risk metric that takes into account the variance of uncertainty is to be optimized.
While being computationally efficient, our proposed method does not require knowledge of the
analytical form of the underlying probability distribution that quantifies transition uncertainty, nor
is it limited to a certain class of distributions such as Gaussian. This atypical feature enables us
to leverage modern data-driven generative models for uncertainty quantification. We demonstrate
the applicability of our approach to the aforementioned robot navigation task, where we show that
the proposed framework can safely navigate the robot towards its goal while interacting with more
than 50 humans simultaneously in real time. Moreover, our risk-aware formulation is demonstrated
to promote safety in both simulation and a real-world experiment, by inducing a proactive robot
behavior that avoids risky situations where high variance of uncertainty could lead to imminent
collision.
The last part of this thesis considers model uncertainty, which is attributed to imperfect modeling
of the underlying stochastic phenomena. Our approach makes the planner distributionally robust,
in which the planner selects a control policy that acts against a worst-case distribution within an
offline-computed set of plausible distributions that could quantify transition uncertainty. We de-
velop a tractable algorithm leveraging mathematical equivalence between risk-aware planning and
distributionally robust planning. We show in simulation that the proposed planning framework can
safely avoid collision despite imperfect knowledge of the stochastic human motion model. Further-
more, our approach lets the risk-aware planner dynamically adjust the level of risk-sensitivity online,
which further improves the flexibility of conventional risk-aware planning methods.
The algorithms developed in this thesis will ultimately allow intelligent mobile robots to oper-
ate in considerably more uncertain and dynamic workspaces than the current industrial standard.
This will open up possibilities for various practical applications, including autonomous field robots
for persistent environmental monitoring, fully-automated driving on urban roads, and autonomous
drone flights in densely populated areas for logistics services. We believe that such “cage-free”
robotic operations will be enabled by proper consideration and treatment of uncertainty, and that
our methods will pave the way towards more reliable robotic autonomy in open and interactive
environments.
v
Acknowledgments
This thesis would not have been possible without the tremendous support of my family, friends,
colleagues, and mentors.
First and foremost, I thank my research advisor, Professor Mac Schwager, for his generous and
invaluable guidance over the course of my doctoral and master’s study. I would not have decided to
pursue a doctoral degree without his advice, to always trust and follow one’s own scientific curiosity
in conducting research. Ever since I joined the Multi-Robot Systems Lab (MSL), his precise and
insightful feedback let me overcome many obstacles in solving challenging robotics problems. Not
only is he a great academic supervisor, but he is always caring and respectful of fellow researchers,
students and their families. I also thank the rest of my thesis reading committee members, Professor
Grace Gao and Professor Mykel Kochenderfer, for their willingness to serve in the committee and
continued guidance in organizing and writing this thesis. I have had the fortunate opportunity
to learn from them inside and outside classrooms as well as at conferences. I additionally thank
Professor Dorsa Sadigh and Professor Monroe Kennedy, for being on my thesis defense committee
and taking their valuable time to evaluate my research.
I have had a pleasure of working with highly skilled and talented researchers through a mul-
titude of collaborative research projects. I thank Professor Marco Pavone and Boris Ivanovic in
the Autonomous Systems Lab for all the constructive feedback and discussion on utilizing data-
driven trajectory prediction methods in planning. I especially thank Boris Ivanovic for developing
and improving Trajectron++ to seamlessly integrate it with the downstream planner. The project
would not have been completed without his unparalleled expertise in deep generative modeling, cod-
ing, and mathematical analysis. I also thank Professor Negar Mehr at the University of Illinois at
Urbana-Champaign and Doctor Adrien Gaidon at Toyota Research Institute, for their tremendous
guidance and support throughout multiple projects. Their unique and fascinating perspectives on
control theory and machine learning have greatly helped to enrich the scientific quality and impact
of the research.
None of the research would have been possible without generous financial support that I re-
ceived over the years. I am grateful for Toyota Research Institute, ONR grant N00014-16-1-2787
vi
and N00014-18-1-2830, NSF grant CMMI-1562335 and NRI-1830402, Japan Student Services Orga-
nization Scholarship, and finally Masason Foundation Fellowship.
I extend my sincere gratitude to all the current and former MSL members, who are not only
extremely skillful and diligent but always open-minded and friendly. I thank Doctor Zijian Wang
and Doctor Eric Cristofalo for being great mentors. I thank Doctor David Fridovich-Keil for giving
me professional advice on research as well as career choice. I thank Doctor Alex Koufas, Adam
Caccavale, Kunal Shah, Ravi Haksar, Jun En Low for working together to make quadrotors more
reliable, and hosting enjoyable lab happy hours. I thank Mingyu Wang and Simon Le Cleac’h for
fruitful discussions on the Toyota project. I thank Preston Culbertson, Keiko Nagami, and Joe
Vincent for sharing their expertise on reinforcement learning and deep learning theory. I thank
Trevor Halsted for answering the many silly questions I asked from the next desk. I thank the rest
of the MSL for all the interesting discussions and conversations over the years.
I am extremely fortunate to have such a caring and wonderful family. I thank my father, Michi-
nari, for encouraging me to apply to graduate schools outside Japan and always supporting my life
decisions. I thank my mother, Kazuyo, for being all ears whenever I needed someone to talk to
about study abroad problems. I thank my brother, Yuki, for his constant support and always taking
care of the cats while I was away from home. I also thank Carin Pacifico, Yvonne Schmidt, Brian
Schmidt, Cindy Fulchiron, Peter Fulchiron, Nancy Sherlock, Leo Redmond, Gayle Keck, and Paul
Herman, for helping me settling down in the Bay Area and getting accustomed to the life in the
Autonomous robots have made a huge leap in the first two decades of the 21st century. In warehouses
and logistics centers, we nowadays see thousands of robots that coordinate with each other to fulfill
part of the shipment tasks [155]. Robots have even served as kitchen staff in some restaurants,
equipped with skills to pick up and flip burger patties without detailed human instructions [128].
There is no doubt that these achievements are attributed to advanced integration of perception,
prediction, motion planning and control, as they all provide modern robots with essential capabilities
to fulfill complex real-world tasks. Nevertheless, the current reliable applications of autonomous
robots are largely limited to dedicated, closed spaces. If we look on public roads, for example, most
of the cars are still driven by humans. In fact, a recent white paper [140] even expects that it may not
be until the 2060s that fully-autonomous vehicles will completely replace human-driven ones. This
in turn highlights that the cutting-edge perception, planning and control are not mature enough for
such applications. What are major differences between warehouse robots and self-driving vehicles?
Why is the current technology not mature enough to deploy autonomous robots into general open
environments?
One key factor is uncertainty that prevents reliable robotic applications in open and interactive
environments. In the case of warehouse robots, there is little or no uncertainty about the operating
space or the other robots to coordinate with, as the robots are usually monitored and operated by a
central computing system. On the other hand, uncertainty is inevitable for autonomous vehicles on
public roads; vehicles are not explicitly communicating, have to deal with partial and full occlusions
in the surrounding environment, and must interact with pedestrians and bicyclists whose states of
mind or future motion are not completely known. The fact that those systems are safety-critical
further complicates the situation, as we cannot simply deploy an existing autonomy stack and hope
that it works on average. Therefore, uncertainty has to be systematically addressed in designing
perception, prediction, planning, and control algorithms, as well as throughout their integration into
a unified autonomy stack.
1
CHAPTER 1. INTRODUCTION 2
In this thesis, we identify and address challenges associated with the environmental and dynamic
uncertainty in order to effectively operate autonomous robotic systems in open environments, where
those systems must also interact with other agents or objects. These include scenarios that are
both safety-critical and non-safety-critical. In particular, this thesis focuses on designing planning
and control algorithms for intelligent mobile robots that can both 1) incorporate different types
of uncertainty that are present in real-world applications, and 2) either mitigate, avoid the risk
of, or be robust against such uncertainty to increase the possibility of successfully accomplishing
a given task. As will become clear later in the thesis, designing such algorithms often requires us
to also consider perception and prediction as part of the entire autonomy stack. This is because
the information about the uncertainty must be first provided to the downstream planner by the
upstream perception/prediction modules. In this sense, perception and prediction are both tightly
coupled with planning and control.
1.1 Definition and Taxonomy of Uncertainty
The term uncertainty is used in this thesis to refer to two distinct types of phenomena. For one,
uncertainty means things that a robot could know perfectly but do not in practice, for example
due to a limited sensing capability of the robot. For the other, uncertainty signifies outcomes of
inherently random or noisy phenomena, no matter whether the robot could perfectly observe them
or not. In the machine learning and statistical inference literature, those two types of uncertainties
are often referred to as epistemic uncertainty and aleatoric uncertainty, respectively [34, 141]. For
the sake of this thesis, however, we choose to categorize uncertainty based on its source in a robotic
system rather than its nature. Specifically, we define and hereafter use the following terminology.
1. State Uncertainty. The robot does not perfectly know all of the system state at the current
time, which possibly includes the state of the surrounding environment as well as the state
of mind or intent of other agents. This is due to limited knowledge, lack of communication,
or imperfect sensing capability, which collectively leads to epistemic uncertainty about the
current state.
2. Transition Uncertainty. Even if the state of a system of interest is completely known at
the current time, uncertainty is incurred as the state transitions to a different one over time,
due to aleatoric uncertainty that exists in the dynamics of the system.
3. Model Uncertainty. The robot may be aware that the state uncertainty and/or the tran-
sition uncertainty exist, but the robot’s internal model to quantify the uncertainty is still
imprecise, due to its limited capability to describe the behavior of complex dynamical sys-
tems.
CHAPTER 1. INTRODUCTION 3
A similar taxonomy is employed in [79] to explain how uncertainty arises in sequential decision-
making problems. For us, this categorization is convenient, since we can introduce different objectives
and problem formulations to address each root cause of uncertainty. For instance, state uncertainty
could be actively mitigated by strategically interacting with other agents and/or utilizing onboard
sensors. While transition uncertainty is inherent in the dynamics, the robot could still proactively
avoid risky situations where it is found in a high-stakes state that could damage itself or surrounding
agents. For model uncertainty, we could attempt to replace a low-fidelity model with a higher-fidelity
one, or make our planning algorithms robust to such incompleteness.
1.2 Approach and Related Work
The objective of this thesis is to propose and analyze novel algorithms for planning and control of mo-
bile robots such that they collectively address the aforementioned challenges about uncertainty, espe-
cially in open and interactive environments. We take a probabilistic approach to model and quantify
uncertainty. This is in contrast to conventional robust control approaches [37] or Hamilton-Jaocbi-
Bellman reachability methods [157, 26], wherein uncertainty is treated as a bounded disturbance
term that does not necessarily have an associated probability distribution. Utilizing probability
theory is advantageous in several aspects. First, probability theory lets us formally estimate and
quantify state uncertainty, which we can then reduce through sequential decision making and control
as appropriate. We will discuss this in detail in Chapter 2 and 3. Second, probabilistic treatment of
uncertainty naturally leads to a notion of risk that can be represented as a mathematical expression,
which is useful for the robot in order to recognize and avoid risky situations. We will address this
in Chapter 4 and 5.
Throughout the thesis, we are primarily focused on problems with continuous states and con-
trols, and optionally continuous time. This is in contrast to finite discrete formulations that are
often assumed in sequential decision-making problems [93, 30, 142]. We have made this choice in-
tentionally, since open and interactive environments that we are concerned with in the thesis are
most naturally modeled as a continuous world, especially when they are highly dynamic. Although
some discrete algorithms can be extended to handle continuous spaces, as we will detail in Chapter
2, we also demonstrate in Chapter 3 that our continuous formulation leads to a solution method
that outperforms such an extended discrete algorithm.
We take a model-based approach in developing our planning/control algorithms so that we can
exploit physical properties of dynamical systems and onboard sensors when possible. This is contrary
to black-box data-driven methods to derive a control policy, such as model-free reinforcement learn-
ing (RL) [106, 28]. Although such RL methods are agnostic to the underlying analytical properties
of the system, the derived policy is fixed after training and cannot be easily modified at run-time
without re-training. In Chapter 4, we demonstrate that it is trivial for our model-based control
CHAPTER 1. INTRODUCTION 4
algorithm to induce various control policies that have different risk-sensitivity levels as appropri-
ate, without changing cost function definitions (or re-training). It is also worth noting that our
algorithms are developed to run online with frequent re-planning. Compared to offline model-based
methods, our online approach is capable of considering large state spaces and adjusting behavior of
the robot to rapidly changing environments. For instance, a pedestrian that suddenly appears in
the robot’s sensor footprint could be handled by online methods via re-planning, whereas it is more
difficult for offline methods to address such a case and avoid collision in advance. Of course, this
capability of online methods comes at some computation cost during deployment due to frequent
re-planning, as opposed to RL approaches or offline model-based methods. To address this potential
issue, we develop our algorithms so that they are efficient enough for real-time or near-real-time
execution. Specifically, our proof-of-concept implementations already achieve at most two to three
times the computation time needed for real-time deployment.
The methods proposed in the thesis are related to and developed based on the following domains
from probabilistic inference, sequential decision-making, and control.
1. Recursive Bayesian Inference. We use techniques from Bayesian inference to recursively
update the posterior distribution over unknown states, in order to quantify the state uncer-
tainty given a history of sensory observations. It is known that exact posterior inference is
intractable in many practical cases [154], since it involves marginalization and normalization.
Thus, we employ existing approximate inference techniques to tractably compute the posterior
distribution, which includes Extended Kalman Filter (EKF) [177], Unscented Kalman Filter
(UKF) [154], and Multi-hypothesis Extended Kalman Filter (MHEKF) [115, 101].
2. Belief Space Planning. Given a recursive Bayesian inference algorithm, we can quantify
the state uncertainty with a posterior distribution, which is also referred to as a belief state
[153]. Belief space planning, wherein a control policy is sought that optimizes a user-defined
cumulative cost objective in the belief space, is a principled framework to make optimal se-
quential decisions and reduce state uncertainty as necessary [114]. We show in Chapter 2 and
Chapter 3 that our belief space planning algorithms, coupled with various perception models,
can effectively mitigate the state uncertainty as well as balance uncertainty reduction and task
accomplishment, depending on the problem specification.
3. Stochastic Optimal Control. Some problems may come with an assumption that the
perception module is so precise that state uncertainty can be effectively ignored. We use tools
from stochastic optimal control [20, 14] to address such problems wherein transition uncertainty
is now a major concern. Furthermore, we show in Chapter 3 that stochastic optimal control
is also applicable to belief space planning problems where state uncertainty does exist.
4. Risk-Aware and Distributionally-Robust Planning. In Chapter 4 and 5, We further
CHAPTER 1. INTRODUCTION 5
extend stochastic optimal control methods to consider risk associated with underlying stochas-
ticity. Risk-aware planning is a mathematical scheme that endows robots with the capability
of quantifying and managing risk. This is achieved either through optimizing a risk metric or
enforcing a risk-aware constraint [97]. This thesis focuses on the former, in particular on a
risk metric known as the entropic risk measure [64, 174, 97]. Furthermore, in Chapter 5 we
employ a connection between entropic risk and distributional robustness [117], wherein the
robot makes robust decisions against imprecise characterization of distributions that govern
stochastic phenomena. This enables the robot to handle model uncertainty. We note that
connections between other risk metrics and distributional robustness are also known in the
literature [30, 97, 159].
1.3 Applications
The methods presented in this thesis have broad applications in mobile robotics and beyond. Our
belief space planning methods presented in Chapter 2 and 3 enable reduction of state uncertainty,
and thus can be applied to robots tasked with gathering information about surrounding environments
using onboard sensors. For instance, our algorithms can be used to derive a desirable trajectory for a
field robot performing a persistent monitoring task to keep track of a spatio-temporal phenomenon.
Such information gathering is also applicable to active intent inference of other interacting agents
that are not explicitly communicating, as we will discuss in detail in Chapter 2. Furthermore,
belief space planning can also tackle problems where information gathering is necessary but not the
primary goal. As an example, a manipulation robot attempting to move an object to a given goal
configuration may not know about its mass and moments of inertia beforehand, which is likely in
practice. In Chapter 3, we will show that our belief space planning algorithm is capable of handling
the scenario, by simultaneously estimating those initially unknown properties and manipulating the
object. Similar issues appear in autonomous driving and aerial robot navigation as well where the
primary objective is to move from a start to a goal location safely, for which uncertainty reduction
about the surrounding environment is crucial.
In particular, autonomous driving and aerial robot navigation are two representative applications
where we must not ignore the safety-critical aspect of the tasks. This is where our risk-aware and
distributioally-robust planning methods discussed in Chapter 4 and 5 become applicable and can
complement the aforementioned belief space planning approaches. As a specific problem instance,
Chapter 4 will focus on ground robot navigation in close proximity to humans, where collision risk
with humans has to be quantified and managed. Chapter 5 will add an additional complexity to
the problem by introducing uncertainty on the (already) stochastic motion model for humans. Our
distributionally robust planner can accomplish the navigation task despite the imperfect knowledge
of the human model.
CHAPTER 1. INTRODUCTION 6
1.4 Contributions
The novel contributions of this thesis are summarized as follows. The references accompanying each
item represent the author’s own work published or under review in referred conference proceedings
and journals.
1. Active vision-based intent inference [110]. We employ monocular vision as the percep-
tion model, which is widely used for various robotic applications such as autonomous drone
navigation [94] and simultaneous localization and mapping (SLAM) [44]. In particular, we con-
sider an active vision [33, 179] scenario, in which a robot observing the motion of another robot
is tasked with inferring or “decoding” the intent of that robot. Such a problem can be viewed
as a motion-based communication [125, 126, 127, 8, 73] task, where messages between robots
are sent and received using their whole-body motion. We propose a novel belief space planning
approach to actively reducing the state uncertainty about the other interacting robot’s intent,
even when there is no direct (e.g. wireless) communication between the robots. The proposed
framework is based on the Multi-Hypothesis Extended Kalman Filter (MHEKF) [115, 101] for
tracking a multi-modal belief, as well as a variant of Monte Carlo Tree Search (MCTS) [19]
for reducing the entropy of the belief over time. In simulation with two interacting drones, we
demonstrate that our method can correctly identify the hidden intent of the other robot, with
higher accuracy and confidence than baseline policies.
2. Fast approximate belief space planning [111, 112]. We propose a novel online algorithm
for fast, general-purpose belief space planning. Unlike other existing works, our approach is
based on the perturbation theory of differential equations. Specifically, we extend Sequential
Action Control (SAC) [6], a framework for approximate optimal control of nonlinear deter-
ministic systems, to stochastic dynamics. The resulting Stochastic SAC algorithm is run in
the belief space and is named SACBP. Despite being an approximate method, SACBP does
not require discretization of spaces or time and synthesizes control signals in near real-time.
Thus, SACBP is suitable for various tasks that require agile robot control with high-frequency
observations, which are common in open and interactive environments. Our algorithm is useful
in different application domains coupled with various perception models. In continuous robot
control problems where state uncertainty is a major obstacle, we show that the algorithm
significantly outperforms other online and offline methods including an MCTS approach.
3. Fast risk-sensitive planning with probabilistic generative models [108]. We propose a
fast online algorithm for risk-sensitive optimal control, which adopts the entropic risk measure
as the objective function. Entropic risk considers the variance of the underlying distribution
and thus can mitigate “rare but catastrophic” events. Our method extends the Stochastic SAC
algorithm to this risk-sensitive setting, and the algorithm is named Risk-Sensitive Sequential
Action Control (RSSAC). Crucially, RSSAC does not require knowledge of the analytical form
CHAPTER 1. INTRODUCTION 7
of the distribution, nor is it limited to a certain class of distributions such as Gaussian; a
black-box probabilistic generator can be used to model transition uncertainty. Leveraging this
desirable property, we apply RSSAC to safe autonomous robot navigation in human crowd,
by incorporating a deep-learned generative model for probabilistic human motion prediction
recently proposed by Salzmann, Ivanovic et al. [134]. Our robot navigation framework con-
stitutes a prediction-control pipeline, which is shown to be capable of promoting safety of
crowd-robot interaction by inducing a robot behavior that proactively avoids risky situations.
4. Distributionally Robust Planning under Model Uncertainty [109]. We present a
novel online method for distributionally robust control, which plans a locally optimal feedback
policy against a worst-case distribution within a given set of plausible distributions. This
set is defined by an offline-estimated KL divergence bound between a user-specified Gaussian
model and the unknown ground-truth distribution. Leveraging mathematical equivalence be-
tween distributionally robust control and risk-sensitive optimal control, the proposed approach
achieves efficient local optimization to tractably compute the control policy for nonlinear sys-
tems. This equivalence also suggests that our method serves an algorithm to dynamically
adjust the risk-sensitivity level online for risk-sensitive control. The benefits of the distribu-
tional robustness as well as the automatic risk-sensitivity adjustment are demonstrated in a
dynamic collision avoidance scenario with model uncertainty, in which the predictive distribu-
tion of human motion is erroneous.
1.5 Organization
The organization of this thesis is summarized in Figure 1.1. As can be seen, each of the subsequent
chapters is to address one or more of the uncertainties discussed in Section 1.1. We begin with
problems where state uncertainty is a major concern. In Chapter 2, we specifically focus on the
problem of active intent inference, wherein a robot equipped with a monocular camera is tasked
to reduce intent uncertainty of the other interacting agent by observing its motion. We derive a
recursive Bayesian inference algorithm for simultaneous estimation of the initially unknown relative
pose and the intent. We then build on top of the inference algorithm a non-myopic active vision
framework; a belief space planning approach is taken to derive a control policy that aims at reducing
the entropy of the belief. In Chapter 3, we abstract the belief space planning framework and discuss
its theory rather than a specific application. An extensive literature review of belief space planning
is given. Then, we address the challenge of performing fast decision-making with high-frequency
observations, a crucial capability for autonomous robots in open and interactive environments. To
that end, a novel online framework for efficient, general-purpose belief space planning is developed
based on Stochastic SAC. Note that Stochastic SAC itself is a contribution of this thesis, whose
rigorous analysis is given in Appendix A. Overall, the methods presented in Chapter 2 and 3 are
CHAPTER 1. INTRODUCTION 8
State
Uncertainty
Transition
Uncertainty
Model
Uncertainty
Uncertainty in Open and Interactive Environments
Approaches and Applications
Belief Space
PlanningRisk-Sensitive
Planning
(Chapter 4)
Distributionally
Robust Planning
(Chapter 5)
Active Intent Inference
(Chapter 2)
General Applications
(Chapter 3)
Autonomous Navigation and
Human-Robot Interaction
Figure 1.1: Roadmap illustrating the organization of this thesis.
suited for reduction of state uncertainty through onboard perception.
Chapter 4 shifts focus from state uncertainty to transition uncertainty. We discuss how belief
space planning alone is insufficient to address difficulties arising from transition uncertainty, espe-
cially in safety-critical scenarios; for the robotic system to remain safe under transition uncertainty, it
is crucial to take “rare but catastrophic events” into consideration. To achieve this computationally,
we formulate risk-sensitive optimal control and propose an efficient solution method. This is derived
as a natural extension of the Stochastic SAC framework to the entropic risk measure. The resulting
RSSAC algorithm is applied to safe autonomous robot navigation in human crowd, to demonstrate
superior performance of the algorithm and effectiveness of risk-sensitivity in safety-critical problems.
Chapter 5 considers model uncertainty that arises due to imperfect modeling of the stochastic
robotic system. Model uncertainty leads to a mismatch between an assumed system model and the
ground-truth, which is often inevitable. To address this model mismatch, we formulate a distribu-
tionally robust control problem, which seeks for a feedback policy that acts against a worst-case
distribution within a given set of plausible models. We then show that such a problem can be made
equivalent to risk-sensitive optimal control, referring to prior work in control theory [117]. Leverag-
ing this equivalence, we propose an efficient online algorithm for distributionally robust control of
nonlinear stochastic systems. Similarly to Chapter 4, we apply our algorithm to safe autonomous
robot navigation.
Finally, Chapter 6 concludes this thesis, presenting lessons learned as well as future research
directions.
Chapter 2
Active Vision-Based Intent
Inference
In this chapter, we consider a motion-based communication [125, 126, 127, 8, 73] problem wherein
whole-body motion is used as a means of sending messages between robots. We focus on an active
vision [33, 179] scenario in which a receiving robot equipped with a monocular camera is tasked
with decoding a message (or more generally, hidden intent) that is encoded in a sending robot’s
trajectory. The difficulty in this problem arises from the monocular vision model of the receiver
and the unknown relative pose between robots, which brings inherent ambiguity into the trajectory
identification, and hence the messages decoding. Assuming that there is a finite set of possible
messages that the sender can express, we first introduce an online Bayesian inference method for the
receiving robot to simultaneously estimate its relative pose to the sender, and the trajectory class
of the sender. We show that the Multi-hypothesis Extended Kalman Filter (MHEKF) [115, 101]
becomes a natural choice to tractably implement this inference algorithm. Furthermore, we formulate
a belief space planning problem to derive a non-myopic active vision policy for the receiver to actively
disambiguate similar trajectory class hypotheses. The policy is constructed online by a variant
of Monte Carlo Tree Search (MCTS) algorithm [19, 23] and aims at reducing the entropy of the
trajectory class distribution. Our simulation results using two interacting quadrotors demonstrate
that the proposed control policy results in an accurate trajectory classification and thus effective
intent inference. The proposed framework for active vision has broad applications where intent
ambiguity of other agents is of a concern for the robot. This includes autonomous driving as well
as autonomous drone operations in open and interactive environments. The materials presented in
this chapter are also reported in [110].
9
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 10
2.1 Introduction and Related Work
Robots that interact with one another are often assumed to directly communicate over a wireless
network. However, in many instances a wireless network is not available, or cannot be relied upon.
This introduces state uncertainty about the intent of other robots or agents, which could significantly
impede reliable operation of robots in interactive environments. To discuss and address such an issue,
this chapter considers motion-based communication, in which a sender robot encodes a message in
its trajectory, which is decoded by a receiver robot. The receiver robot has only a monocular camera
with which to observe the motion, and the relative pose between the two robots is unknown. This can
make the identification of the trajectory difficult or impossible without an active vision strategy. We
present a recursive Bayesian inference algorithm by which the receiving robot classifies the trajectory
from the sending robot, and simultaneously estimates the relative pose between the two robots. We
also propose a non-myopic active vision algorithm based on Monte Carlo Tree Search (MCTS) with
Double Progressive Widening (DPW) [32]. The planning is performed in the belief space and drives
the receiving robot to move around so that it can disambiguate between similar trajectory classes.
We use entropy as the measure of uncertainty in the trajectory class distribution, which the control
policy seeks to reduce over time in a non-myopic manner. We present theoretical results as well as
experimental results evaluated in a realistic simulation environment.
Our estimation and control framework may be used for robots to literally exchange messages
through their trajectories, or more generally, as an abstraction for motion classification, prediction,
and intent inference. The sender (whether intentionally or not) sends a “message” with its trajectory,
which the receiving robot must “decode.” For example, an autonomous car must categorize an
observed vehicle as aggressive or defensive based on its observed trajectory. Thus our algorithm has
applications to autonomous driving, where vehicles have to infer the intent and predict the motion
of other vehicles, pedestrians, and bikers, and to autonomous drones, which have to avoid collisions
with other drones and pedestrians in their airspace by observing and predicting their trajectories.
The algorithm may also model motion-based communication in animals, for example the “waggle
dance” of honeybees studied in zoology [163].
In robotics, previous work [8, 73] has employed control theoretic approaches for the sending
robot to design a set of distinguishable and energy-optimal trajectories. In contrast, here we consider
algorithms for the receiver to decode and infer the message. This problem is particularly complicated
by the monocular camera of the receiver (which gives only a 2D projection of the sender’s trajectory)
and the unknown relative pose between the two robots. In this scenario, two different trajectories can
look identical from the point of view of the receiver, e.g., an “L” trajectory looks like a “V” trajectory
from an oblique angle (see Figure 2.1). Although recent work [8, 73, 125, 126, 127] has explored the
emerging field of motion-based communication, an effective way to resolve the ambiguity inherent
due to the monocular vision and unknown pose has not yet been addressed to our knowledge.
More specifically, the main contributions of this study are two fold. First, we present an online
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 11
Figure 2.1: (Left) A sample 2D trajectory codebook with potential ambiguities between entries.(Right) A noisy camera projection of a complete trajectory. The trajectory may be either an “L”or a “V” from the image. Further inference requires the receiver to move to obtain a more accurateestimate.
estimation framework in which both the sender’s message and the receiver’s relative pose are se-
quentially estimated. Based on the Markov assumption, we provide a recursive Bayesian inference
algorithm to handle the multi-modal distribution over the joint state of the receiver and the sender.
A Gaussian approximation and a model linearization yield a Gaussian mixture representation of the
robots’ joint state, naturally leading to the Multi-hypothesis Extended Kalman Filter (MHEKF)
[115, 101] algorithm. Second, we actively control the receiver to move around the sender in order to
disambiguate between similar trajectory hypotheses. An information-theoretic approach is taken in
which we control the receiver to maximally decrease the entropy of the trajectory class distribution.
Furthermore, we plan over multiple future poses of the receiver by formulating the optimization
problem as belief space planning and employing the MCTS algorithm with DPW. The performance
of this control policy is compared to that of a greedy Monte Carlo algorithm and a random policy
in simulations.
Our algorithms draw inspiration from several existing works in the areas of state estimation,
computer vision, and sequential decision-making under uncertainty. For example, general nonlinear
filtering with a Gaussian mixture model is introduced by Alspach and Sorenson in [4]. Jochmann et
al. discuss its applicability to robot localization [72]. Chen and Liu [27] introduce mixture Kalman
filters to handle conditional dynamic linear models. The MHEKF algorithm specifically is applied
to various state estimation problems such as spacecraft attitude estimation [101] and bearings-only
object tracking [115]. In computer vision, simultaneous active vision classification and camera pose
estimation is proposed by Sipe and Casasent in [144]. Although their estimation scheme has some
similarities with ours, it does not deal with motion-based communication, and it does not use an
entropy-based control objective. Denzler et al. [33] and Zobel et al. [179] present an entropy-based
camera control framework, and derive a closed-form control objective for Gaussian distributions.
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 12
Seekircher et al. [139] employ Monte Carlo exploration to implement an entropy-based active vision
control. These works also do not address communication through motion, but active vision. In
terms of the planning algorithms, MCTS is an online method for finding optimal decisions [19] and
is widely used in various decision processes that involve uncertainty, ranging from game-playing [50]
to robotics [15]. The algorithm is anytime, and can be also applied to continuous or large state and
action spaces by introducing DPW [32]. There are several existing works that apply Monte Carlo
Tree Search to active perception and belief space planning [15, 145]. Our work has similar goals,
but we are particularly interested in applications in motion-based communication and active intent
inference.
The rest of the chapter is organized as follows. In Section 2.2 we define the robot models
and formalize the problem. In Section 2.3 we derive the recursive Bayesian inference formula and
provide the MHEKF algorithm. A belief initialization algorithm is discussed in Section 2.4. The
active-vison control policy for the receiver is formulated in Section 2.5. Simulation results in a ROS-
Gazebo environment are presented in Section 2.6 and conclusions with future directions are given
in Section 2.7.
2.2 Preliminaries
The sender robot chooses a pre-specified trajectory to encode the corresponding message. The
sender then executes this trajectory, while the receiver observes the motion and controls its viewing
positions. In this work we let the receiver move and estimate the message simultaneously while the
sender executes a single trajectory.
This leads to a sequential state estimation and decision making problem represented by the
Bayesian network structure in Figure 2.2. We formalize this stochastic process in what follows,
dropping the time index for convenience when it does not cause confusion.
2.2.1 Trajectory generation
A trajectory of the sender is represented by an initial position x0 ∈ R3 and a set of n stochastic
transitions xk | xk+1 = fk(xk,m) + vk, k = 0, . . . , n− 1, where fk is a deterministic, time-indexed
state transition function of the sender, m is a message or a trajectory class and vk is zero-mean
Gaussian white noise. The initial position x0 is also assumed to be Gaussian-distributed. Given
a trajectory class m ∈ 1, . . . ,M, the knowledge of the corresponding noiseless trajectory can
be completely described by a set of n + 1 waypoints x0, x(m)1 , . . . , x
(m)n . This is expressed in a
frame of reference named the trajectory frame and is fixed in inertial space. We assume that the
receiver has already acquired the set of ideal trajectories as a “trajectory codebook.” This can be
done through modeling or learning the deterministic transition function fk of the sending agent, and
is an interesting topic for future research. We also assume that the receiver can identify the time
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 13
Figure 2.2: Bayesian network structure for our trajectory classification and pose estimation problem.The receiver robot observes the camera image z of the sender’s position x to estimate the relativepose r and the sender’s message m, and applies control u.
when the sender initiates the trajectory execution.
2.2.2 State transition model
Let sk , (rTk , x
Tk )T ∈ R9 be the joint robot state of the receiver and the sender at timestep k, where
r denotes the pose. Conditional on a trajectory class m, the state transition for the sender is as
described in the previous section. The receiver’s pose r specifies its relative position and attitude in
the trajectory frame as
r , (ωT, tT)T ∈ R6, (2.1)
where ω = (ωx, ωy, ωz)T ∈ R3 gives the angle-axis representation of the 3D rotation group SO(3).
The direction of ω specifies the axis around which the receiver’s body frame is rotated with respect to
the trajectory frame, and the `2 norm ‖ω‖2 gives the magnitude of the rotation angle [89]. Similarly,
t = (tx, ty, tz)T ∈ R3 represents the translation of the receiver’s body in the trajectory frame. The
pose r is not known to the receiver or the sender.
The new position tk+1 is given by the current position tk, attitude ωk and the translational
control input utrans ∈ R3 as
tk+1 = tk + exp([ωk]×)utrans, (2.2)
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 14
where [·]× : R3 → so(3) is the skew-symmetric matrix operator:
[ω]× ,
0 −ωz ωy
ωz 0 −ωx−ωy ωx 0
(2.3)
and so(3) is the Lie algebra of SO(3). The matrix exponential in (2.2) can be evaluated using the
Rodrigues formula [89]:
exp([ω]×) = I3×3 +sin θ
θ[ω]× +
(1− cos θ)
θ2[ω]2×, (2.4)
where θ = ‖ω‖2.
The new attitude ωk+1 is related to the current attitude ωk and the rotational control input
urot ∈ R3 as
exp([ωk+1]×) = exp([ωk]×) exp([urot]×). (2.5)
The concatenated control input uk = (uTrot, u
Ttrans)
T ∈ R6 is represented in the body coordinates of
the receiver.
Finally, the state transition for the receiver is also corrupted by zero-mean Gaussian white noise.
Along with the state transition noise for the sender, we denote Q ∈ S9×9+ as the symmetric, positive
semi-definite covariance matrix for the joint state transition.
Note that the receiver will apply an input uk only after the sender has moved from xk−1 to xk,
and the state transition to rk+1 is assumed to be finished by the time the sender reaches the next
position xk+1.
2.2.3 Observation model
The monocular vision of the receiver can be approximated by a simple pinhole camera model. We
will provide a brief review of the model that relates a sender’s position to an observation. Interested
readers are referred to [89, 60] for further details.
Pinhole camera projection
Let z ∈ R2 represent the observation expressed in the pixel coordinates on the image plane and
x ∈ R3 be the Cartesian coordinates representing the sender’s position. The relationship between a
3D point x and the corresponding 2D point z is given by
λ
(z
1
)= P
(x
1
), (2.6)
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 15
where λ is a scale factor and P ∈ R3×4 is the projection matrix. P can be further decomposed into a
product of the camera calibration matrix K ∈ R3×3 and the external parameters matrix T ∈ R3×4.
The camera calibration matrix contains intrinsic parameters of the camera and we assume that it is
known to the receiver.
External parameters matrix
The external parameters matrix T defines the rotation and the translation of the camera frame with
respect to the trajectory frame:
T ,(Rctraj t
ctraj
), (2.7)
where tctraj is the translational vector and Rctraj is the 3D rotation matrix represented in the camera
coordinates. These parameters can be expressed in terms of the receiver’s state r = (ωT, tT)T as
Rctraj = RbTc [exp([ω]×)]T (2.8)
tctraj = −RbTc [exp([ω]×)]T t−RbTc tbc, (2.9)
where Rbc and tbc define the relative pose between the body and the camera coordinates, and are
determined by the physical position and attitude of the camera on the receiver robot.
The pinhole camera projection (2.6) along with the external parameters matrix (2.7) comprise the
nonlinear observation function h(s). The actual observation is also subject to zero-mean Gaussian
white noise with covariance R ∈ S2×2+ , hence the distribution is given by
p(z | s) = N (z;h(s), R) . (2.10)
2.2.4 Repeated trajectory execution
The stochastic process depicted in Figure 2.2 represents a single execution of a trajectory. In fact, the
process is repeated multiple times, with rn+1 in the current iteration being r1 in the next iteration.
Similarly, the sender transitions from xn to x1 through x0. Note that the posterior at the end of
the current iteration becomes the prior for the next iteration.
2.3 Bayesian Filtering
As illustrated in Figure 2.2, the trajectory class m and the receiver’s state r are correlated via the
observation z and the sender’s position x. We therefore need to simultaneously estimate m and
s = (rT, xT)T even though eventually we are more interested in m than s in this problem. Formally,
this is equivalent to describing the joint distribution of s and m in the Bayesian framework, which
is discussed in this section.
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 16
2.3.1 Recursive Bayesian Inference Formula
We are now set to derive the recursive Bayesian inference formula to determine the posterior distri-
bution over the joint robot state s and the trajectory class m.
Theorem 2.1 (Recursive Bayesian Update). The posterior of the joint distribution p(sk+1,m |z1:k+1, u1:k) can be obtained recursively as a function of the class-dependent state transition model
p(sk+1 | sk, uk,m) and the observation model p(zk+1 | sk+1), given the Bayesian network structure
depicted in Figure 2.2 and the prior belief p(sk,m | z1:k, u1:k−1).
Proof. From the definition of conditional probability, the posterior can be factored into a product
of a continuous probability density function p of the joint state s and a discrete probability mass
We first need to define the state transition model for the belief b and the reward model related to
the state transition. Recall from Algorithm 2.1 that the belief transition is completely described
by the Multi-hypothesis Extended Kalman Filter equations once given the control input uk and the
observation zk+1. There is stochasticity involved in this transition due to the observation. However,
we can still specify the distribution from which a new observation is drawn based on the current
prior belief b that has z1:k, u1:k as evidence variables. Namely,
p(zk+1 | z1:k, u1:k) =
M∑m=1
p(zk+1 | m, z1:k, u1:k)P (m | z1:k, u1:k), (2.24)
where p(zk+1 | m, z1:k, u1:k) = N(zk+1;h(µ
(m)k+1), H
(m)k+1
)and P (m | z1:k, u1:k) = φ
(m)k . Therefore, for
the sake of planning we can simulate stochastic belief transitions by drawing an observation sample
from the following Gaussian mixture distribution:
zk+1 ∼M∑m=1
φ(m)k N
(zk+1;h(µ
(m)k+1), H
(m)k+1
)(2.25)
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 21
Algorithm 2.2 Sample a new belief and a reward.
INPUT: current belief b, control input u.OUTPUT: new belief b′, reward r
1: Compute µ(m) and H(m) for each m from b.2: Sample z ∼
∑Mm=1 φ
(m)N(z;h(µ(m)), H(m)
)3: Obtain b′ from u and z with Algorithm 2.1.4: Compute r = Reward(b, b′) (2.27).
Reward model
The reward needs to be designed so that the receiver robot exhibits desired behavior. In the motion-
based communication problem, we would like to reduce the uncertainty in the trajectory class dis-
tribution. Entropy provides a measure of uncertainty, which is defined as follows:
H[M] = −M∑m=1
P (m) logP (m), (2.26)
where M is the random variable representing the trajectory class.
Since our MDP is defined in the belief space, we can directly use the entropy as our reward. The
reward function Reward(b, b′) is determined by the change in entropy between the current belief b
and a new belief b′
Reward(b, b′) = l (H[M | b])− l (H[M | b′]) , (2.27)
where l : R+ → R is a strictly monotonically decreasing function. Note that the reward is positive
when the new belief b′ results in a smaller entropy than b since the receiver aims at maximizing the
reward.
In the simulation experiments we chose l(·) = log(·). This is to prevent the reward from effectively
getting discounted as the entropy is reduced, because the entropy of a categorical distribution (2.26)
is always lower-bounded by zero.
The resulting sampling algorithm is summarized in Algorithm 2.2.
2.5.2 Upper Confidence Trees
The Upper Confidence Trees algorithm approximates the state-action value function using Monte
Carlo simulations with a generative model that gives reward and state transition samples. It is an
online method that only considers a certain depth from the current state. The depth and the number
of iterations are limited by the amount of computational resources available before selecting the next
action. Starting from the current state as the root node, in each iteration the algorithm gradually
grows a search tree. A node in the tree represents a reachable state and an edge corresponds to an
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 22
action. In our belief MDP, a state is a belief b and an action is a control input u.
One iteration consists of a simulation with four different stages listed below [23].
1. Selection. This stage recursively selects a child node in the tree according to a child selection
policy called UCB1 until a leaf node is reached. A detailed explanation on UCB1 is provided
below.
2. Expansion. When a new state is observed that did not exist in the tree at the selection step,
the the state is added to the tree as a new leaf node.
3. Simulation. If a new leaf node is added in the expansion stage, a simulation is run for the
rest of the depth based on a rollout policy to compute the initial approximate value of the
node.
4. Backpropagation. Lastly, the simulation result is back-propagated from the leaf node to
the root node through the selected nodes, updating the visit counts and the approximated
state-action values for each node.
UCB1
The key aspect of the algorithm is the use of a search heuristic called UCB1, which well balances
the exploration and exploitation trade-off. At node b, control input u is chosen so that it maximizes
the following score [32]:
scoret(b, u) =totalRewardt(b, u)
nt(b, u) + 1+ kucb
√log(t)
nt(b, u) + 1, (2.28)
where t is the number of times the node b has been visited, nt(b, u) is the number of times the control
input u has been applied at b, and totalRewardt(b, u) is the total cumulative reward obtained so far
by executing u at b. kucb is a parameter that controls the amount of exploration.
Double Progressive Widening
The conventional Upper Confidence Trees algorithm assumes small discrete action and state spaces.
Nevertheless, the belief MDP defined above consists of stochastic transitions in a continuous domain
and the probability of reaching the same belief state twice is zero. Couetoux et al. [32] point out
that the algorithm fails in such a case and propose a modification as follows. Instead of always
transitioning to a new state using the generative model, a new state is added to the tree only
if the number of children is less than or equal to dCnt(b, u)αe, where C > 0 and α ∈]0, 1[ are
constants; otherwise an existing child is chosen with probability nt(b, u, b′)/nt(b, u), where nt(b, u, b
′)
is the number of transitions observed from b to b′ with u. A similar strategy is used to limit the
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 23
Figure 2.3: (Left) Simulation environment with the receiver in the front and the sender in the back.(Right) An image obtained by the forward-facing camera mounted on the receiver. The smoothyellow trajectory is overlaid for clarity.
number of actions available if the action space is also large. The k-th action becomes available
when k = dCutαue with constants Cu > 0 and αu ∈]0, 1[. This restriction is known as Progressive
Widening, and the algorithm is called Double Progressive Widening as it is applied to both the
(belief) states and the actions.
2.6 Experimental Results
2.6.1 Simulation setup
We designed our experiment in Gazebo [82], a multi-robot simulator compatible with ROS [122].
Two Hummingbird [66] quadrotors are used in the simulation, whose dynamics are simulated with
RotorS [49]. Figure 2.3 illustrates the simulation environment1.
The trajectory codebook presented in Figure 2.4 consists of three different elliptic trajectories.
A trajectory is represented as an ordered collection of 6 waypoints. The sender takes 7 seconds to
transition between successive waypoints, which corresponds to a single timestep. The codebook was
deliberately designed to make the classification challenging. Indeed, the prior initialization with the
Levenberg-Marquardt algorithm resulted in an ambiguous classification as depicted in Figure 2.5.
The parameters used in the Multi-hypothesis Extended Kalman Filter and the Double Progressive
Widening algorithms are summarized in Table 2.1. The horizontal and the vertical field-of-views
(FoVs) of the camera are 80.1 [deg] and 64.5 [deg], respectively. Initially the two quadrotors are 4.2
[m] apart in the horizontal direction. In controlling the receiver, we only allow lateral translational
motion and yaw rotation as possible control inputs: utrans = (0, uy, 0)T and urot = (0, 0, uyaw)T with
1The supplemental video is available at https://youtu.be/u4HxiA2d2IQ.
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 24
Figure 2.4: 2D trajectory codebook used in the experiment. The sender robot moves clockwise.The trajectories are intentionally ambiguous from different observer angles, to make trajectoryclassification difficult for the receiver.
Figure 2.5: (Left) A set of waypoint measurements used to initialize the prior. The smooth yellowtrajectory and the waypoints are overlaid for clarity. (Right) Prior distribution obtained from theleft image.
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 25
Table 2.1: Parameter values for MHEKF and MCTS.Parameter Value Parameter ValueQ diag(Qr, Qt, Qx) kucb 10.0Qr diag(1× 10−3, 1× 10−3, 1× 10−2) C 3.0Qt diag(5× 10−2, 5× 10−2, 1× 10−3) α 0.1Qx diag(1× 10−4, 1× 10−3, 1× 10−3) Cu 10.0R diag(1× 101, 1× 101) αu 0.3
Figure 2.6: Classification error (left) and entropy (right) averaged over 30 cases. The MCTS activevision policy outperforms greedy and random in both criteria.
uy ∈ −1,−0.5, 0, 0.5, 1 m and uyaw ∈ −45,−25,−10, 0, 10, 25, 45 deg. The resulting size of the
control space is |U| = 35.
We compared the performance of the proposed control policy with a greedy algorithm and a
random policy. The greedy algorithm still employs the Monte Carlo simulation to estimate the
reward, but it only considers the immediate next action and performs Algorithm 2.2 ten times per
action to select one that has the maximum average reward.
2.6.2 Simulation results
The classification and the pose estimation performance was evaluated through 30 simulation episodes
with different trajectory classes and priors. In each episode the sender repeated the trajectory
execution 6 times, with an interval time of 7 seconds between each execution. We observed that the
MCTS and the greedy policies both actively controlled the yaw rotation as well as the horizontal
position, trying to keep the sender inside the FoV of the camera. As can be seen in Figure 2.6,
the MCTS control policy resulted in the lowest classification error, which is defined as 1−P (mtrue)
where mtrue is the true trajectory class. Even though all the policies achieved similar errors after
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 26
Figure 2.7: Attitude error (left) and position error (right) averaged over 30 cases. The MCTS activevision policy performs the best for pose estimation, even though it is not explicitly trying to optimizethe relative pose estimate.
step 22, there are still notable differences in the entropy, indicating that with the proposed approach
the receiver is most confident in its classification.
However, this performance improvement was obtained at the expense of computation time. On
average, MCTS with Double Progressive Widening took 4.0 seconds to compute the next action,
with the maximum tree depth of 3 and 400 iterations. The greedy algorithm only took 1.6 seconds
to perform 350 iterations. Obviously, the random policy ran much faster than the other two, taking
0.0024 seconds. All the computation was performed on a laptop with Intel Core i7-6700HQ CPU
and 16GB RAM. Considering this runtime and the classification performance discussed above, there
is a trade-off between the proposed approach and the greedy policy since the performance of the
greedy policy is not significantly worse than the proposed method.
The MCTS active vision policy also outperformed the other two policies in the pose estimate,
even though the filtering algorithm was not able to correct the pose error within 30 steps regardless
of the control policy. In Figure 2.7, the attitude error is represented by the magnitude of the rotation
angle between the true attitude and the mean estimate. The position error is given by the distance
between the true position and the mean estimate, which is divided by the initial distance of the
two robots to give a normalized unit-less distance error. Interestingly, the results suggest that the
classification can become correct when the pose estimate is still erroneous.
In order to confirm that the filter essentially improves the pose estimate over time, we ran an
additional episode for 100 steps with the proposed control policy. The result is presented in Figure
2.8. Both the attitude error and the position error converged to near-zero values after 100 steps.
CHAPTER 2. ACTIVE VISION-BASED INTENT INFERENCE 27
Figure 2.8: Attitude error (top) and position error (bottom) for a single long episode with class 1trajectories. The MCTS active vision policy was used to move the receiver. Eventually, the relativepose estimation converges even though the receiver’s controller optimizes for trajectory class, notpose error.
2.7 Conclusions
This chapter presents an online active vision algorithm via belief space planning for motion-based
communication between two robots. Intent inference difficulties arising from the monocular vision
model and the unknown relative pose are addressed. We present a recursive Bayesian inference algo-
rithm for simultaneous trajectory classification and relative pose estimation. We further propose an
information-theoretic approach to active vision, in which an entropy-based objective is optimized via
the belief MDP formulation and the MCTS algorithm with DPW. Through the extensive simulation
study, we have demonstrated that the proposed method significantly improves both the classification
accuracy and confidence compared to baseline policies. Future research directions include improving
our active vision method to take account of certain constraints such as collision avoidance. We are
also interested to generalize this vision-based belief space planning framework and apply to other
applications such as autonomous driving.
Chapter 3
Fast Approximate Belief Space
Planning
In Chapter 2, we discussed how to formulate a belief space planning problem to derive an active
vision policy for inferring intent of other interacting agents. In this chapter, we focus on the theory
of belief space planning rather than a specific application, which includes a holistic review of various
belief space planning algorithms that exist in the literature. We address the challenge of performing
fast decision-making in continuous spaces with high-frequency observations, a crucial capability for
autonomous robots operating in open and interactive environments.
Specifically, we propose a novel online framework for fast, general-purpose belief space planning.
Unlike other existing works, our approach presented in this chapter is based on the perturbation
theory of differential equations and extends Sequential Action Control (SAC) [6] to stochastic belief
dynamics. The resulting algorithm, which we name SACBP, is a stochastic optimal control algo-
rithm for belief space planning and does not require discretization of spaces or time. SACBP is an
anytime algorithm that synthesizes control signals in near real-time, and can handle general para-
metric1 belief dynamics under certain assumptions. Despite being an approximate (i.e. suboptimal)
method, we demonstrate the effectiveness of SACBP in an active sensing scenario and a model-based
Bayesian reinforcement learning problem. In these challenging problems, we show that the algorithm
significantly outperforms other online and offline solution methods including approximate dynamic
programming and local trajectory optimization. The materials reported in this chapter are based
on our conference paper [111] and the extended journal version [112].
1In this chapter, we use the term “parametric” to mean that the functional form of the belief distribution is fixed.Many of the known belief dynamics (i.e. Bayesian filters) are parametric, except the particle filter [154] that is usuallyconsidered non-parametric. Section 3.2.1 gives a detailed explanation on the possible choice of filters.
28
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 29
3.1 Introduction and Related Work
Planning under uncertainty still remains as a challenge for robotic systems. Various sources of
uncertainty, including unknown states, stochastic disturbances, and imperfect sensing, significantly
complicate problems that are otherwise easy. For example, suppose that a robot needs to manipulate
an object from some initial state to a desired goal. If the mass properties of the object are not known
beforehand, the robot needs to simultaneously estimate these parameters and perform control, while
taking into account the effects of their uncertainty; the exploration and exploitation trade-off needs to
be resolved [145]. On the other hand, uncertainty is quite fundamental in motivating some problems.
For instance, a noisy sensor may encourage the robot to carefully plan a trajectory so the observations
taken along it are sufficiently informative. This type of problem concerns pure information gathering
and is often referred to as active sensing [103], active perception [9], or informative motion planning
[63], wherein the sole objective is to reduce state uncertainty using onboard perception.
A principled approach to address all those problems is to form plans in the belief space, where
the planner chooses sequential control inputs based on the evolution of the belief state, to optimize a
belief-based cost accumulated over time. This approach enables the robot to appropriately execute
controls under state and transition uncertainties since they are both naturally incorporated into the
belief. Belief space planning is well suited also for generating information gathering actions [119].
This chapter proposes a novel online algorithm for fast, general-purpose belief space planning.
It does not require discretization of the state space or the action space, and can directly han-
dle continuous-time system dynamics. The algorithm provides an approximate solution to belief
space planning; instead of seeking for a globally optimal control policy, it constantly perturbs a
given nominal policy to optimize the expected value of a first-order cost reduction, proceeding in
a receding-horizon fashion. We are inspired by the Sequential Action Control (SAC) algorithm
recently proposed in [6] for model-based deterministic optimal control problems. SAC is an online
method to synthesize control signals in real time for challenging (but deterministic) physical systems
such as a cart pendulum and a spring-loaded inverted pendulum. Based on the concept of SAC, this
chapter develops an algorithmic framework to control stochastic belief systems whose dynamics are
governed by parametric Bayesian filters.
3.1.1 Related Work in Belief Space Planning
There is a large body of literature in belief space planning. Below we briefly review relevant work
in three major types of solution methods: greedy strategies, trajectory optimization methods, and
belief MDP and POMDP approaches. We then change our perspective and discuss advantages and
drawbacks of closed-loop and open-loop planning schemes, while also introducing other relevant
work that do not necessarily fall into the three categories mentioned above.
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 30
Greedy Strategies. Belief space planning is known to be challenging for a couple of reasons.
First, the belief state is continuous and can be high-dimensional even if the underlying state space
is small or discrete. Second, the dynamics that govern the belief state transitions are stochastic due
to unknown future observations. Greedy approaches alleviate the complexity by ignoring long-term
effects and solve single-shot decision making problems sequentially. Despite their suboptimality for
long-term planning, these methods are often employed to find computationally tractable solutions
and achieve reasonable performance in different problems [18, 139, 138], especially in the active
sensing domain.
Trajectory Optimization Methods. In contrast to the greedy approaches, trajectory optimiza-
tion methods take into account multiple timesteps at once and find non-myopic solutions. In doing
so, it is often assumed that the maximum likelihood observation (MLO) will always occur during
planning [119, 45, 114]. This heuristic assumption yields a deterministic optimal control problem,
to which various nonlinear trajectory optimization algorithms are applicable. However, ignoring the
effects of stochastic future observations can degrade the performance [161]. Other methods [161, 123]
that do not rely on the MLO assumption are advantageous in that regard. In particular, belief iLQG
[161] performs iterative local optimization in a Gaussian belief space by quadratically approximat-
ing the value function and linearizing the dynamics to obtain a time-varying affine feedback policy.
However, this method as well as many other solution techniques in this category result in multiple
iterations of intensive computation and can require a significant amount of time until convergence.
Belief MDP and POMDP Approaches. Belief space planning can be modeled as a Markov
decision process (MDP) in the belief space, given that the belief state transition is Markovian. If
the reward (or cost) is defined as an explicit function of the state and the control, the problem
is equivalent to a partially observable Markov decision process (POMDP) [74]. A key challenge in
POMDPs and belief MDPs has been to address problems with large state spaces. This is particularly
important in belief MDPs since the state space for a belief MDP is a continuous belief space. To
handle continuous spaces, Couetoux et al. [32] introduce double progressive widening (DPW) for
MCTS [19]. In [145], this MCTS-DPW algorithm is run in a Gaussian belief space to solve the object
manipulation problem mentioned earlier. We too have applied MCTS-DPW to active vision-based
intent inference in Chapter 2, in which a non-myopic policy is derived to reduce the entropy of a
multi-modal belief. Sunberg and Kochenderfer [150] study suboptimality of DPW in POMDPs to
propose POMCPOW, which effectively handles continuous state, action, and observation spaces by
maintaining weighted particle mixture beliefs.
While MCTS-DPW as well as other general purpose POMDP methods [150, 147] are capable
of handling continuous state spaces, their algorithmic concepts are rooted in dynamic programming
and tree search, requiring a sufficient amount of exploration in the tree. (Indeed, our MCTS-DPW
implementation in Chapter 2 assumes that the robot has up to 7 seconds to compute its next action.)
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 31
The tree search technique also implicitly assumes discrete-time transition models. In fact, most
prior works reviewed in this section are intended for discrete-time systems. A notable exception
is [24], in which a sequence of discrete-time POMDP approximations is constructed such that it
converges to a true continuous-time model. However, this approach still relies on an existing POMDP
solver [84] that is designed for discrete systems. There still remains a need for an efficient and
high-performance belief space planning algorithm that is capable of directly handling systems with
inherently continuous-space, continuous-time dynamics, such as maneuvering micro-aerial vehicles,
or autonomous cars at freeway speeds.
Closed-loop and Open-loop Planning. An important aspect of belief space planning is the
stochastic belief dynamics, which essentially demand that one plan over closed-loop (i.e. feedback)
policies. This is a primary assumption in belief MDP and POMDP frameworks where an optimal
mapping from beliefs to control actions are sought [65, 79]. However, computing exact optimal
policies in general is intractable due to curse of dimensionality and curse of history [113, 96, 147].
Therefore, in practice we need to make certain approximations to achieve a tractable solution within
a reasonable computation time, whether the solution method is online or offline. For POMDP
methods, those approximations are often done by discretization and/or sampling [84, 24, 147, 150].
Belief iLQG [161] makes an approximation by restricting the policy to a time-varying affine feedback
controller and performing local optimization. Other methods that provide closed-loop policies in-
clude Feedback-based Information RoadMap (FIRM) [2], its extension with online local re-planning
[1], and T-LQG [123]. FIRM-based methods, which are developed primarily for motion planning
problems, construct a graph in the belief space whose edges correspond to local LQG controllers.
T-LQG approximately decouples belief space planning into local trajectory optimization and LQG
tracking via a use of the separation principle. The local trajectory optimization part achieves one of
the lowest asymptotic computational complexities among existing trajectory optimization methods,
and thus can be solved efficiently with a commercial nonlinear programming (NLP) solver [123].
Even though closed-loop methods are appealing, their computation cost can be prohibitive in
some challenging planning and control problems, especially when the state space is large and yet
real-time performance is required. In such cases, a practical strategy to alleviate the computa-
tional burden is open-loop planning, wherein one seeks to optimize a static sequence of control
actions. Often times feedback is provided through re-planning of a fixed horizon, open-loop control
sequence after making a new observation. This combination of open-loop planning and feedback
through re-planning is called receding horizon control (RHC) or model predictive control (MPC)
[65]. Although they do not consider feedback at each planning time, MPC methods in general
have successfully solved challenging planning and control problems with high efficiency where fast
closed-loop policy computation can be impractical. Examples of such real-time applications include
trajectory tracking with quadrotors [10] and agile autonomous driving on dirt [175], although they
are both in the optimal control literature. Within belief space planning, some methods based on the
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 32
MLO assumption [118, 114] are run as MPC.
3.1.2 Contributions
Our approach presented in this chapter takes the form of MPC, which is reasonable for high-
dimensional and continuous belief space planning problems that also demand highly-dynamic, real-
time maneuvers with high frequency observation updates. However, the proposed method is signif-
icantly different from the previous approaches discussed in Section 3.1.1. We view the stochastic
belief dynamics as a hybrid system with time-driven switching [61], where the controls are applied in
continuous time and the observations are made in discrete time. A discrete-time observation creates
a jump discontinuity in the belief state trajectory due to a sudden Bayesian update of the belief
state. This view of belief space planning yields a continuous-time optimal control problem of a high-
dimensional hybrid system. We then propose a model-based control algorithm to efficiently compute
control signals in a receding-horizon fashion. The algorithm is based on Sequential Action Control
(SAC) [6]. SAC in its original form is a deterministic, model-based hybrid control algorithm, which
“perturbs” a nominal control trajectory in a structured way so that the cost functional is optimally
reduced up to the first order. The key to this approach is a careful use of the perturbation theory
of differential equations that is often discussed in the mode scheduling literature [41, 172]. As a
result, SAC derives the optimal perturbation in closed form and synthesizes control signals at a high
frequency to achieve a significant improvement over other optimal control methods that are based
on local trajectory optimization [6].
We apply the perturbation theory to parametric Bayesian filters and derive the optimal control
perturbation using the framework of SAC. Even though each control perturbation is small, high-
frequency control synthesis and online re-planning yield a series of control actions that is significantly
different than the nominal control, reacting to stochastic observations collected during execution or
online changing conditions. Furthermore, we extend the original SAC algorithm to also take account
of stochasticity in the future observations during planning, by incorporating Monte Carlo sampling
of nominal belief trajectories. Our key contribution is the resulting continuous belief space planning
algorithm, which we name SACBP. The algorithm has the following desirable properties:
1. Although the form of control perturbation is open-loop with on-line re-planning, the perturba-
tion computed by SACBP in near real-time is optimized for better average performance over
the planning horizon than a given nominal control, whether it is open-loop or closed-loop.
2. SACBP does not require discretization of the state space, the observation space, or the control
space. It also does not require discretization of time other than for numerical integration
purposes.
3. General nonlinear parametric Bayesian filters can be used for state estimation as long as the
system is control-affine and the control cost is quadratic.
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 33
4. Stochasticity in the future observations are fully considered.
5. SACBP is an anytime algorithm. Furthermore, the Monte Carlo sampling part of the algorithm
is naturally parallelizable.
6. Even though SACBP is inherently suboptimal for the original stochastic optimal control prob-
lem, empirical results suggest that it is highly sample-efficient and outperforms other open-loop
and closed-loop methods when near real-time performance is required.
There exists prior work [98] that uses SAC for active sensing, but its problem formulation re-
lies on the ergodic control framework, which is significantly different from the belief space planning
framework we propose here. We show that our SACBP outperforms projection-based ergodic trajec-
tory optimization, MCTS-DPW, T-LQG, and a greedy method on an active multi-target tracking
example. We also show that SACBP outperforms belief iLQG, MCTS-DPW, and T-LQG on a
manipulation scenario.
In the next section we derive relevant equations and present the SACBP algorithm along with
a discussion on computational complexity. Section 3.3 provides key results of our mathematical
analysis that lead to a performance guarantee for SACBP. Section 3.4 summarizes the simulation
results. Conclusions and future work are presented in Section 3.5.
3.2 SACBP Algorithm
We first consider the case where some components of the state are fully observable. We begin with
this mixed observability case as it is simpler to explain, yet still practically relevant. For example,
this is a common assumption in various active sensing problems [138, 86, 121] where the state of
the robot is perfectly known, but some external variable of interest (e.g. a target’s location) is
stochastic. In addition, deterministic state transitions are often assumed for the robot. Therefore,
in Section 3.2.1 we derive the SACBP control update formulae for this case. The general belief space
planning where none of the state is fully observable or deterministically controlled is discussed in
Section 3.2.2. An extension to use a closed-loop policy as the nominal control is presented in Section
3.2.3. The computation time complexity is discussed in Section 3.2.4.
3.2.1 Problems with Mixed Observability
Suppose that a robot can fully observe and deterministically control its own state p(t) ∈ Rnp . Other
external states over which the robot does not have direct control are not known and are estimated
with the belief vector b(t) ∈ Rnb . This belief vector characterizes a probability distribution that the
robot uses for state estimation. If the belief is Gaussian, for example, the covariance matrix can be
vectorized column-wise and stacked all together with the mean to form the belief vector. We define
the augmented state as s , (pT, bT)T ∈ Rns .
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 34
Dynamics Model
The physical state p is described by the following ordinary differential equation (ODE):
p(t) = f (p(t), u(t)) , (3.1)
where u(t) ∈ Rm is the control signal. On the other hand, suppose that the belief state only changes
in discrete time upon arrival of a new observation from the sensors. In the case of target tracking,
for example, this means that the change of the location of the target is described by a discrete
time model from the robot’s perspective. The behavior of such a system can be estimated by a
discrete-time Bayesian filter. We will discuss the more general continuous-discrete time filtering
case in Section 3.2.2. Let tk be the time when the k-th observation becomes available to the robot.
The belief state transition is given byb(tk) = g(p(t−k ), b(t−k ), yk)
b(t) = b(tk) ∀t ∈ [tk, tk+1),(3.2)
where t−k is infinitesimally smaller than tk. Nonlinear function g corresponds to a discrete-time,
etc2.) that forward-propagates the belief for prediction, takes the new observation yk ∈ Rq, and
returns the updated belief state. The concrete choice of the filter depends on the instance of the
problem. Note that the belief state stays constant for the most of the time in (3.2). This is because
we are viewing a discrete time model in a continuous time framework.
Equations (3.1) and (3.2) constitute a hybrid system with time-driven switching [61]. This hybrid
system representation is practical since it captures the fact that the observation updates occur less
frequently than the control actuation in general, due to expensive information processing of sensor
readings. Furthermore, with this representation one can naturally handle agile robot dynamics as
they are without coarse discretization in time.
Given the initial state s0 , (p(t0)T, b(t0)T)T and some control u(t) for t ∈ [t0, tf ], the system
evolves stochastically according to the hybrid dynamics equations. The stochasticity is due to a
sequence of stochastic future observations that will be taken by the end of the planning horizon tf .
In this chapter we assume that the observation interval tk+1 − tk , ∆to is fixed, and the control
signals are recomputed when a new observation is incorporated in the belief, although one can also
use a variable observation interval.
2We do not assume the particle filter here, which is non-parametric, since we later require g to be differentiablewhereas the re-sampling process [154] in the particle filter is non-differentiable. Note however that a recent work[76] makes it differentiable by replacing the re-sampling process with a soft approximation. Consideration of suchdifferentiable approximations in SACBP is an interesting future research direction.
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 35
Perturbed Dynamics
The control synthesis of SACBP begins with a given nominal control schedule u(t) for t ∈ [t0, tf ]. For
simplicity we assume here that the nominal control schedule is open-loop, but we remind the reader
that SACBP can also handle closed-loop nominal policies, which we discuss in Section 3.2.3. Suppose
that the nominal control is applied to the system and a sequence of T observations (y1, . . . , yT ) is
obtained. Conditioned on the observation sequence, the augmented state evolves deterministically.
Let s = (pT, bT)T be the nominal trajectory of the augmented state induced by (y1, . . . , yT ).
Now let us consider perturbing the nominal trajectory at a fixed time τ < t1 for a short duration
ε. The perturbed control uε is defined as
uε(t) ,
v if t ∈ (τ − ε, τ ]
u(t) otherwise.(3.3)
Therefore, the control perturbation is determined by the nominal control u(t), the tuple (τ, v), and
ε. Given (τ, v), the resulting perturbed system trajectory can be written aspε(t) , p(t) + εΨp(t) + o(ε)
bε(t) , b(t) + εΨb(t) + o(ε),(3.4)
where Ψp(t) and Ψb(t) are the state variations that are linear in the perturbation duration ε:
Ψp(t) =∂+
∂εpε(t)
∣∣∣∣ε=0
, limε→0+
pε(t)− p(t)ε
(3.5)
Ψb(t) =∂+
∂εbε(t)
∣∣∣∣ε=0
, limε→0+
bε(t)− b(t)ε
. (3.6)
The notation ∂+∂ε represents the right derivative with respect to ε. The state variations at perturba-
tion time τ satisfy Ψp(τ) = f(p(τ), v)− f(p(τ), u(τ))
Ψb(τ) = 0.(3.7)
The initial belief state variation Ψb(τ) is zero because the control perturbation uε has no effect
on the belief state until the first Bayesian update is performed at time t1, according to the hybrid
system model (3.1)(3.2). For t ≥ τ , the physical state variation Ψp evolves according to the following
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 36
first-order ODE:
Ψp(t) =d
dt
(∂+
∂εpε(t)
∣∣∣∣ε=0
)(3.8)
=∂+
∂εpε(t)
∣∣∣∣ε=0
(3.9)
=∂+
∂εf(pε(t), u(t))
∣∣∣∣ε=0
(3.10)
=∂
∂pf (p(t), u(t)) Ψp(t), (3.11)
where the chain rule of differentiation and pε(t)|ε=0 = p(t) are used in (3.11). For a more rigorous
analysis, see Appendix A. The dynamics of the belief state variation Ψb in the continuous region
t ∈ [tk, tk+1) satisfy Ψb(t) = 0 since the belief vector b(t) is constant according to (3.2). However,
across the jumps the belief state variation Ψb changes discontinuously and satisfies
Ψb(tk) =∂+
∂εbε(tk)
∣∣∣∣ε=0
(3.12)
=∂+
∂εg(pε(t−k ), bε(t−k ), yk
) ∣∣∣∣ε=0
(3.13)
=∂
∂pg(p(t−k ), b(t−k ), yk
)Ψp(t
−k ) +
∂
∂bg(p(t−k ), b(t−k ), yk
)Ψb(t
−k ). (3.14)
Perturbed Cost Functional
Let us consider a total cost of the form∫ tf
t0
c (p(t), b(t), u(t)) dt+ h(p(tf ), b(tf )), (3.15)
where c is the running cost and h is the terminal cost. Following the discussion above on the
perturbed dynamics, let J denote the total cost of the nominal trajectory conditioned on the given
observation sequence (y1, . . . , yT ). Under the fixed (τ, v), we can represent the perturbed cost Jε in
terms of J as
Jε , J + εν(tf ) + o(ε), (3.16)
where ν(tf ) , ∂+∂ε J
ε|ε=0 is the variation of the total cost with respect to the perturbation. For
further analysis it is convenient to express the running cost in the Mayer form [92]. Let s(t) be
a new state variable defined by ˙s(t) = c (p(t), b(t), u(t)) and s(t0) = 0. Then the total cost is a
function of the appended augmented state s , (s, sT)T ∈ R1+ns at time tf , which is given by
J = s(tf ) + h (s(tf )) . (3.17)
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 37
Using this form of the total cost J , the perturbed cost (3.16) becomes
Jε = J + ε
1
∂∂ph (p(tf ), b(tf ))∂∂bh (p(tf ), b(tf ))
T
Ψ(tf ) + o(ε), (3.18)
where Ψ(tf ) ,(
Ψ(tf ),Ψp(tf )T,Ψb(tf )T)T
and Ψ is the variation of s. Note that the dot product
in (3.18) corresponds to ν(tf ) in (3.16). The variation Ψ follows the variational equation for t ≥ τ :
˙Ψ(t) =
d
dt
(∂+
∂εsε(t)
∣∣∣∣ε=0
)(3.19)
=∂+
∂ε˙sε(t)
∣∣∣∣ε=0
(3.20)
=∂+
∂εc(pε(t), bε(t), u(t))
∣∣∣∣ε=0
(3.21)
=∂
∂pc(p(t), b(t), u(t))TΨp(t) +
∂
∂bc(p(t), b(t), u(t))TΨb(t), (3.22)
where the initial condition is given by Ψ(τ) = c(p(τ), b(τ), v)− c(p(τ), b(τ), u(τ)).
The perturbed cost equation (3.18), especially the dot product expressing ν(tf ), is consequential;
it tells us how the total cost changes due to the perturbation applied at some time τ , up to the first
order with respect to the perturbation duration ε. At this point, one could compute the value of
ν(tf ) for a control perturbation with a specific value of (τ, v) by simulating the nominal dynamics
and integrating the variational equations (3.11)(3.14)(3.22) from τ up to tf .
Adjoint Equations
Unfortunately, this forward integration of ν(tf ) is not so useful by itself since we are interested
in finding the value of (τ, v) that achieves the largest possible negative ν(tf ), if it exists; it would
be computationally intensive to apply control perturbation at different application times τ with
different values of v and re-simulate the state variation Ψ. To avoid this computationally expensive
search, we mirror the approach presented in [6] and introduce the adjoint system ρ , (ρ, ρTp , ρ
Tb )T
with which the dot product remains invariant:
d
dt
(ρ(t)TΨ(t)
)= 0 ∀t ∈ [t0, tf ]. (3.23)
If we let
ρ(tf ) ,
(1,
∂
∂ph (p(tf ), b(tf ))
T,∂
∂bh (p(tf ), b(tf ))
T
)T
(3.24)
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 38
so that its dot product with Ψ(tf ) equals ν(tf ) as in (3.18), the time invariance gives
ν(tf ) = ρ(tf )TΨ(tf ) (3.25)
= ρ(τ)TΨ(τ) (3.26)
= ρ(τ)T
c (p(τ), b(τ), v)− c (p(τ), b(τ), u(τ))
f(p(τ), v)− f(p(τ), u(τ))
0
. (3.27)
Therefore, we can compute the first-order cost change ν(tf ) for different values of τ once the adjoint
trajectory is derived. For t ∈ [tk, tk+1) the time derivative of Ψ exists, and the invariance property
leads to the following equation:
ρ(t)TΨ(t) + ρ(t)TΨ(t) = 0. (3.28)
It can be verified that the following system satisfies (3.28) with ρ(t) =(ρ(t), ρp(t)
T, ρb(t)T)T
:˙ρ(t) = 0
ρp(t) = − ∂∂pc(p(t), b(t), u(t))− ∂
∂pf(p(t), u(t))Tρp(t)
ρb(t) = − ∂∂bc(p(t), b(t), u(t)).
(3.29)
Analogously, across discrete jumps we can still enforce the invariance by setting ρ(tk)TΨ(tk) =
ρ(t−k )TΨ(t−k ), which holds for the following adjoint equations:ρ(t−k ) = ρ(tk)
ρp(t−k ) = ρp(tk) + ∂
∂pg(p(t−k ), b(t−k ), yk
)Tρb(tk)
ρb(t−k ) = ∂
∂bg(p(t−k ), b(t−k ), yk
)Tρb(tk).
(3.30)
Note that the adjoint system integrates backward in time as it has the boundary condition (3.24)
defined at tf . More importantly, the adjoint dynamics (3.29)(3.30) only depend on the nominal
trajectory of the system (p, b) and the observation sequence (y1, . . . , yT ). Considering that ρ(t) = 1
at all times, the cost variation term ν(tf ) is finally given by
In order to efficiently optimize (3.31) with respect to (τ, v), we assume hereafter that the control
cost is additive quadratic 12u
TCuu and the dynamics model f(p, u) is control-affine with linear term
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 39
H(p)u, where H : Rnp → Rm can be any nonlinear mapping. Although the control-affine assumption
may appear restrictive, many physical systems possess this property in engineering practice. As a
result of these assumptions, (3.31) becomes
ν(tf ) =1
2vTCuv + ρp(τ)TH(p(τ))(v − u(τ))− 1
2u(τ)TCuu(τ). (3.32)
So far we have treated the observation sequence (y1, . . . , yT ) as given and fixed. However, in
practice it is a random process that we have to take into account. Fortunately, our control opti-
mization is all based on the nominal control schedule u(t), with which we can both simulate the
augmented dynamics and sample the observations. To see this, let us consider the observations as
a sequence of random vectors (Y1, . . . , YT ) and rewrite ν(tf ) in (3.32) as ν(tf , Y1, . . . , YT ) to clarify
the dependence on it. The expected value of the first order cost variation is given by
E[ν(tf )] =
∫ν(tf , Y1, . . . , YT )dP, (3.33)
where P is the probability measure associated with these random vectors. Although we do not
know the specific values of P, we have the generative model; we can simulate the augmented state
trajectory using the nominal control and sequentially sample the stochastic observations from the
belief states along the trajectory.
Using the linearity of expectation for (3.32), we have
E[ν(tf )] =1
2vTCuv + E[ρp(τ)]TH(p(τ))(v − u(τ))− 1
2u(τ)TCuu(τ). (3.34)
Notice that only the adjoint trajectory is stochastic. We can employ Monte Carlo sampling to
sample a sufficient number of observation sequences to approximate the expected adjoint trajectory.
Now (3.34) becomes a convex quadratic in v for a positive definite Cu. Assuming that Cu is also
diagonal, analytical solutions are available to the following convex optimization problem with an
input saturation constraint:
minimizev
E[ν(tf )]
subject to a v b,(3.35)
where a, b ∈ Rm are some saturation vectors and is an elementwise inequality. This optimization
is solved for different values of τ ∈ (t0 + tcalc + ε, t0 + tcalc + ∆to), where tcalc is a pre-allocated
computation time budget and ∆to is the time interval between two successive observations as well
as control updates. We then search over (τ, v∗(τ)) for the optimal perturbation time τ∗ to globally
minimize E[ν(tf )]. There is only a finite number of such τ to consider since in practice we use
numerical integration such as the Euler scheme with some step size ∆tc to compute the trajectories.
In [6] the finite perturbation duration ε is also optimized using line search, but in this work we set ε
as a tunable parameter to reduce the computation time. The complete algorithm is summarized in
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 40
Algorithm 3.1 SACBP Control Update for Problems with Mixed Observability
INPUT: Current augmented state s0 = (p(t0)T, b(t0)T)T, nominal control schedule u(t) for t ∈[t0, tf ], perturbation duration ε
OUTPUT: Optimally perturbed control schedule uε(t) for t ∈ [t0, tf ]1: for i = 1:N do2: Forward-simulate nominal augmented state trajectory (3.1)(3.2) and sample observation se-
quence (yi1, . . . , yiT ) along the augmented state trajectory.
where ρ is the adjoint system that follows the dynamics:ρ(t−k ) = ∂∂bg(b(t−k ), yk)Tρ(tk)
ρ(t) = − ∂∂bc(b(t), u(t))− ∂
∂bf(b(t), u(t))Tρ(t)(3.41)
with the boundary condition ρ(tf ) = ∂∂bh(b(tf )). Under the control-affine assumption for f and the
additive quadratic control cost, the expected first order cost variation (3.40) yields
E[ν(tf )] =1
2vTCuv + E[ρ(τ)]TH(b(τ))(v − u(τ))− 1
2u(τ)TCuu(τ), (3.42)
where H(b(τ)) is the control coefficient term in f .
Although it is difficult to state the general conditions under which this control-affine assumption
holds, for instance one can verify that the continuous-discrete EKF [177] satisfies this property if
the underlying system dynamics fsys is control-affine:µ(t) = fsys(µ(t), u(t))
Σ(t) = AΣ + ΣAT +Q(3.43)
In the above continuous-time prediction equations, A is the Jacobian of the dynamics function
fsys(x(t), u(t)) evaluated at the mean µ(t) and Q is the process noise covariance. If fsys is control-
affine, so is A and therefore so is Σ. Obviously µ is control affine as well. As a result the dynamics
for the belief vector b = (µT, vec(Σ)T)T satisfy the control-affine assumption.
Mirroring the approach in Section 3.2.1, we can use Monte Carlo sampling to estimate the
expected value in (3.42). The resulting algorithm is presented in Algorithm 3.2.
3.2.3 Closed-loop Nominal Policy
In Sections 3.2.1 and 3.2.2 we assumed that the nominal control was an open-loop control schedule.
However, one can think of a scenario where a nominal control is a closed-loop policy computed
offline, such as a discrete POMDP policy that maps beliefs to actions [84]. Indeed, SACBP can also
handle closed-loop nominal policies. Let π be a closed-loop nominal policy, which is a mapping from
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 42
Algorithm 3.2 SACBP Control Update for General Belief Space Planning Problems
INPUT: Current belief state b0 = b(t0), nominal control schedule u(t) for t ∈ [t0, tf ], perturbationduration ε
OUTPUT: Optimally perturbed control schedule uε(t) for t ∈ [t0, tf ].1: for i = 1:N do2: Forward-simulate nominal belief state trajectory (3.36) and sample observation sequence
(yi1, . . . , yiT ) along the belief trajectory.
3: Backward-simulate nominal adjoint trajectory ρi (3.41) with sampled observations.4: end for5: Compute Monte Carlo estimate: E[ρ] ≈ 1
N
∑Ni=1 ρ
i.6: for (τ = t0 + tcalc + ε; τ ≤ t0 + tcalc + ∆to; τ ← τ + ∆tc) do7: Solve quadratic minimization (3.35) with (3.42). Store optimal value ν∗(τ) and optimizerv∗(τ).
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 45
(3.2d) for function g, there exist finite non-negative constants K2,K3,K4,K5 and positive integers
L1, L2 such that ∀x ∈ Rnx and ∀y ∈ Rny , the following relations hold:
‖g(x, y)‖2 ≤ K2 +K3‖x‖L12 +K4‖y‖L2
2 +K5‖x‖L12 ‖y‖
L22 (3.50)
∥∥∥∥ ∂∂xg(x, y)
∥∥∥∥2
≤ K2 +K3‖x‖L12 +K4‖y‖L2
2 +K5‖x‖L12 ‖y‖
L22 (3.51)
Remark 3.2. Assumptions (3.2a) and (3.2c) are used to show existence and uniqueness of the solu-
tion to the differential equation (3.48) as well as the variational equation under control perturbation.
(See Propositions A.1 and A.14 in Appendix A.) Note that Assumption (3.2c) is essentially a Lips-
chitz continuity condition, which is a quite common assumption in the analysis of nonlinear ODEs,
as can be seen in Assumption 5.6.2 in [43] and Theorem 2.3 in [77]. Assumptions (3.2b) and (3.2d)
are the growth conditions on x across adjacent modes. Recall that in belief space planning where
the system state x is the belief state b, the jump function g corresponds to the observation update of
the Bayesian filter. The form of the bound in (3.50) and (3.51) allows a broad class of continuous
functions to be considered as g, and is inspired by a few examples of the Bayesian update equations
as presented below.
Proposition 3.1 (Bounded Jump for Univariate Gaussian Distribution). Let b = (µ, s)T ∈ R2
be the belief state, where µ is the mean parameter and s > 0 is the variance. Suppose that the
observation y is the underlying state x ∈ R corrupted by additive Gaussian white noise v ∼ N (0, 1).
Then, the Bayesian update function g for this belief system satisfies Assumption (3.2d).
Proof. The Bayesian update formula for this system is given by g(b, y) = b , (µ, s)T, where
µ = µ+s
s+ 1(y − µ) (3.52)
s = s− s2
s+ 1(3.53)
is the update step of the Kalman filter. Rearranging the terms, we have
g(b, y) =1
s+ 1
(µ+ sy
s
)(3.54)
and consequently,
∂
∂bg(b, y) =
1
(s+ 1)2
(s+ 1 y
0 1
). (3.55)
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 46
We will show that the function g satisfies Assumption (3.2d). For the bound on g(b, y),
‖g(b, y)‖22 =1
(s+ 1)2
(µ+ sy)2 + s2
(3.56)
≤ (µ+ sy)2 + s2 (3.57)
≤ ‖b‖22 +
(bT
(y
1
))2
(3.58)
≤ ‖b‖22(2 + ‖y‖22), (3.59)
where we have used (s+ 1)2 ≥ 1 and the Cauchy-Schwarz inequality. Thus,
‖g(b, y)‖2 ≤ ‖b‖2√
2 + ‖y‖22 (3.60)
≤√
2‖b‖2 + ‖b‖2‖y‖2 (3.61)
Similarly, the bound on the Jacobian yields∥∥∥∥ ∂∂bg(b, y)
∥∥∥∥2
2
≤∥∥∥∥ ∂∂bg(b, y)
∥∥∥∥2
F
(3.62)
=1
(s+ 1)4
(s+ 1)2 + y2 + 1
(3.63)
=1
(s+ 1)2+
1
(s+ 1)4(y2 + 1) (3.64)
≤ 2 + ‖y‖22. (3.65)
Therefore, ∥∥∥∥ ∂∂bg(b, y)
∥∥∥∥2
≤√
2 + ‖y‖2 (3.66)
This shows that the jump function g for the above univariate Gaussian model satisfies Assumption
(3.2d) with (K2,K3,K4,K5) = (√
2,√
2, 1, 1) and (L1, L2) = (1, 1).
Proposition 3.2 (Bounded Jump for Categorical Distribution). Let b = (b1, . . . , bn)T ∈ Rn be
the n-dimensional belief state representing the categorical distribution over the underlying state x ∈1, . . . , n. We choose the unnormalized form where the probability of x = i is given by bi/
∑ni=1 bi.
Let the observation y ∈ 1, . . . ,m be modeled by a conditional probability mass function p(y | x) ∈[0, 1]. Then, the Bayesian update function g for this belief system satisfies Assumption (3.2d).
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 47
Proof. The Bayes rule gives g(b, y) = b , (b1, . . . , bn), whereb1
b2...
bn
=
p(y | 1)b1
p(y | 2)b2...
p(y | n)bn
. (3.67)
Therefore, we can easily bound the norm of the posterior belief b by
‖g(b, y)‖2 = ‖b‖2 ≤ ‖b‖2, (3.68)
as p(y | x) ≤ 1. The Jacobian is simply the diagonal matrix diag(p(y | 1), . . . , p(y | n)), and hence∥∥∥∥ ∂∂bg(b, y)
∥∥∥∥2
≤ 1. (3.69)
This shows that the jump function g for the categorical belief model above satisfies Assumption
(3.2d) with (K2,K3,K4,K5) = (1, 1, 0, 0) and (L1, L2) = (1, 1).
Assumption 3.3 (Cost Model). The instantaneous cost c : Rnx ×Rm → R is continuous. It is also
continuously differentiable in x. The terminal cost h : Rnx → R is differentiable. Furthermore, we
assume that there exist finite non-negative constants K6,K7 and a positive integer L3 such that for
all x ∈ Rnx and u ∈ B(0, ρmax), the following relations hold:
|c(x, u)| ≤ K6 +K7‖x‖L32 (3.70)∥∥∥∥ ∂∂xc(x, u)
∥∥∥∥2
≤ K6 +K7‖x‖L32 (3.71)
|h(x)| ≤ K6 +K7‖x‖L32 (3.72)∥∥∥∥ ∂∂xh(x)
∥∥∥∥2
≤ K6 +K7‖x‖L32 . (3.73)
Remark 3.3. Assumption 3.3 is to guarantee that the cost function is integrable with respect to
stochastic observations, which are introduced in Assumption 3.4. Note that even though the above
bound is not general enough to apply to all analytic functions, it does include all finite order poly-
nomials of ‖x(t)‖2 and ‖u(t)‖2, for example, since ‖u(t)‖2 is bounded by Assumption 3.1.
Assumption 3.4 (Stochastic Observations). Let (Ω,F ,P) be a probability space. Let (Y1, . . . , YT ) be
a sequence of random vectors in Rny defined on this space, representing the sequence of observations.
Assume that for each Yi all the moments of the `2 norm is finite. That is,
∀i ∈ 1, . . . , T ∀k ∈ N E[‖Yi‖k2
]<∞. (3.74)
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 48
Definition 3.1 (Perturbed Control). Let u ∈ U be a control. For τ ∈ (0, 1) and v ∈ B(0, ρmax),
define the perturbed control uε by
uε(t) ,
v if t ∈ (τ − ε, τ ]
u(t) otherwise,(3.75)
where ε ∈ [0, τ ]. By definition if ε = 0 then uε is the same as u. We assume that the nominal control
u(t) is left continuous in t at t = τ .
3.3.2 Main Results
The main result of the analysis is the following theorem.
Theorem 3.1 (Mode Insertion Gradient). Suppose that Assumptions 3.1 – 3.4 are satisfied. For a
given (τ, v), let uε denote the perturbed control of the form (3.75). The perturbed control uε and the
stochastic observations (Y1, . . . , YT ) result in the stochastic perturbed state trajectory xε. For such
uε and xε, let us define the mode insertion gradient of the expected total cost as
∂+
∂εE
[∫ T
0
c(xε(t), uε(t))dt+ h(xε(T ))
] ∣∣∣∣∣ε=0
. (3.76)
Then, this right derivative exists and we have
∂+
∂εE
[∫ T
0
c(xε(t), uε(t))dt+ h(xε(T ))
] ∣∣∣∣∣ε=0
= c(x(τ), v)− c(x(τ), u(τ)) + E
[∫ T
τ
∂
∂xc(x(t), u(t))TΨ(t)dt+
∂
∂xh(x(T ))TΨ(T )
], (3.77)
where Ψ(t) = ∂+∂ε x
ε(t)∣∣ε=0
is the state variation.
The proof of the theorem is deferred to Appendix A (see Theorem A.1). One can see that the
mode insertion gradient (3.76) is a natural generalization of the ones discussed in [41, 172, 6] to
stochastic hybrid systems. Furthermore, by comparing (3.77) with (3.39) it is apparent that the
right hand side of (3.77) is mathematically equivalent to E[ν(tf )], the quantity to be optimized with
the SACBP algorithm in Section 3.2.
The fact that SACBP optimizes (3.76) leads to a certain performance guarantee of the algorithm.
In the open-loop nominal control case, the term E[ν(tf )] as in (3.34) or (3.42) becomes 0 if the
control perturbation v is equal to the nominal control u(τ). Therefore, as long as u(τ) is a feasible
solution to (3.35) the optimal value is guaranteed to be less than or equal to zero. Furthermore, in
expectation the actual value of E[ν(tf )] matches the one approximated with samples, since the Monte
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 49
Carlo estimate is unbiased. In other words, the perturbation (τ∗, v∗) computed by the algorithm
is expected to result in a non-positive mode insertion gradient. If the mode insertion gradient is
negative, there always exists a sufficiently small ε > 0 such that the expected total cost is decreased
by the control perturbation. In the corner case that the mode insertion gradient is zero, one can
set ε = 0 to not perturb the control at all. Therefore, for an appropriate choice of ε the expected
performance of the SACBP algorithm over the planning horizon is at least as good as that of the
nominal control.
The same discussion holds for the case of closed-loop nominal control policies, when the expression
for E[ν(tf )] is given by (3.44) or (3.45). This is because Theorem 1 still holds if the nominal control
u(t) is a closed-loop policy as stated in Remark A.2 (Appendix A). Therefore, the expected worst-
case performance of the algorithm is lower-bounded by that of the nominal policy. This implies that
if a reasonable nominal policy is known, at run-time SACBP is expected to further improve it while
synthesizing continuous-time control inputs efficiently.
3.4 Simulation Results
We evaluated the performance of SACBP in the following two simulation studies: (i) active multi-
target tracking with range-only observations; (ii) object manipulation under model uncertainty.
SACBP as well as other baseline methods were implemented in Julia3, except for T-LQG [123]
whose NLP problems were modeled by CasADi [5] in Python and then solved by Ipopt [164], a
standard NLP solver based on interior-point methods. All the computation was performed on a
desktop computer with Intel® Core™ i7-8750H CPU and 32.1GB RAM. The Monte Carlo sampling
of SACBP was parallelized on multiple cores of the CPU.
3.4.1 Active Multi-Target Tracking with Range-only Observations
This problem focuses on pure information gathering, namely identifying where the moving targets
are in the environment. In doing so, the surveillance robot modeled as a single integrator can only
use relative distance observations. The robot’s position p is fully observable and the transitions are
deterministic. Assuming perfect data association, the observation for target i is di = ||qi− p+ vi||2,
where qi is the true target position and vi is zero-mean Gaussian white noise with state-dependent
covariance R(p, qi) = R0 + ||qi − p||2R1. We use 0.01I2×2 for the nominal noise R0. The range-
dependent noise R1 = 0.001I2×2 degrades the observation quality as the robot gets farther from the
target. The discrete-time UKF is employed for state estimation in tracking 20 independent targets.
The target dynamics are modeled by a 2D Brownian motion with covariance Q = 0.1I2×2. Similar
to [148], an approximated observation covariance R(p, µi) is used in the filter to obtain tractable
3The code for SACBP as well as the baseline methods is publicly available at https://github.com/StanfordMSL/
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 50
-10 0 10 20 30 40 50x [m]
-10
0
10
20
30
40
50
y [m
]
TargetsRobot
0 50 100 150 200Time [s]
2
3
4
5
6
7
8
Wor
st E
ntro
py V
alue
MCTS-DPWT-LQGErgodicGreedySACBP (Ours)
Greedy MCTS-DPW T-LQG Ergodic SACBP (Ours)0.0
0.2
0.4
0.6
0.8
Onlin
e Re
plan
ning
Tim
e [s
]
Targeted
Figure 3.1: (Left) Simulation environment with 20 targets and a surveillance robot. (Middle) Thehistory of the worst entropy value among the targets averaged over 20 random runs, plotted withthe standard deviation. With the budget of 10 Monte Carlo samples, SACBP showed small variancefor the performance curve and resulted in the fastest reduction of the worst entropy value comparedto every other baseline. (Right) Computation times for Greedy, MCTS-DPW, and SACBP achievedreal-time performance, taking less time than simulated tcalc = 0.15 s.
estimation results, where µi is the most recent mean estimate of qi.
The SACBP algorithm produces a continuous robot trajectory over 200 s with planning horizon
tf − t0 = 2 s, update interval ∆to = 0.2 s, perturbation duration ε = 0.16 s, and N = 10 Monte
Carlo samples. The Euler scheme is used for integration with ∆tc = 0.01 s. The Jacobians and
the gradients are computed either analytically or using an automatic differentiation tool [129] to
retain both speed and precision. In this simulation tcalc = 0.15 s is assumed no matter how long
the actual control update takes. We use c(p, b, u) = 0.05uTu for the running cost and h(p, b) =∑20i=1 exp(entropy(bi)) for the terminal cost, with an intention to reduce the worst-case uncertainty
among the targets. This expression for h(p, b) is equivalent to:
h(p, b) =
20∑i=1
√det(2πeΣi), (3.78)
where Σi is the covariance for the i-th target. The nominal control u(t) is constantly zero.
We compared SACBP against four baseline methods: (i) a greedy algorithm based on the gradient
descent of terminal cost h, similar to [138]; (ii) MCTS-DPW [32, 42] in the Gaussian belief space; (iii)
We also attempted to implement the belief iLQG [161] algorithm, but the policy did not converge for
this problem. We suspect that the non-convex terminal cost h contributed to this behavior, which
in fact violates one of the underlying assumptions made in the paper [161].
MCTS-DPW uses the same planning horizon as SACBP, however it draws N = 25 samples from
the belief tree so the computation time of the two algorithms matches approximately.
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 51
Figure 3.2: Sample robot trajectories (depicted in red) generated by each algorithm. Greedy, MCTS-DPW, and Ergodic did not result in a trajectory that fully covers the two groups of the targets.Ergodic (Open-Loop) mostly strayed around one group because the updated beliefs were not reflectedinto the robot’s trajectory. T-LQG failed to reduce the estimation uncertainty even after 200 s,due to insufficient time to solve the NLP with high-dimensional joint states in an online manner.SACBP successfully explored the space and periodically revisited both of the two target groups.With SACBP, the robot traveled into one of the four diagonal directions for most of the time. Thisis due to the fact that SACBP optimizes a convex quadratic under a box saturation constraint,which tends to find optimal solutions at the corners. In all the figures, the blue lines represent thetarget trajectories and the shaded ellipses are 99% error ellipses at t = 200 s.
For T-LQG, an NLP is formulated so that the optimization objective is a discrete-time equivalent
of the SACBP objective, with the discrete time interval of ∆to = 0.2 s. Unfortunately, the Ipopt
solver takes a long time to solve this NLP; on average the optimization over 10 timesteps (thus 2
s) takes 43.9 s to converge to a local minimum, with the worst case of over 250 s, due to the high
dimensionality of the joint state space. This high computation cost makes it prohibitive to derive
an offline policy over the entire 200 s simulation horizon prior to the execution. Therefore, we only
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 52
test T-LQG in an online manner for this problem, with the same planning horizon as SACBP. For
a fair comparison with the other methods, we adjust the max cpu time parameter of the solver so
that the Ipopt iterations terminate within a computation time that is the same order of magnitude
as SACBP, MCTS-DPW, and ergodic exploration. Note also that T-LQG is a closed-loop planning
method that comes with a local LQG controller to track the optimized nominal belief trajectory.
Designed to require minimal online computation, re-planning for T-LQG only happens if the end of
the planning horizon is reached or the symmetric KL-divergence between the nominal belief state
and the actual belief state exceeds a certain threshold dth. We set dth to 1e6 for this problem, since
we noticed that a smaller value would result in re-planning after almost each observation update.
With our choice, re-planning almost only happens at the end of the planning horizon, which allows
for efficient execution of the policy. More details about the re-planning for T-LQG can be found in
[124, 123].
Ergodic exploration is not a belief space planning approach but has been used in the active
sensing literature. Beginning with the nominal control of zero, it locally optimizes the ergodicity
of the trajectory with respect to the spatial information distribution based on Fisher information.
This optimization is open-loop since the effect of future observations is not considered. As a new
observation becomes available, the distribution and the trajectory are recomputed.
All the controllers were saturated at the same limit. The results presented in Figure 3.1 clearly
indicates superior performance of SACBP while achieving real-time computation. More notably,
SACBP generated a trajectory that periodically revisited the two groups whereas other methods
failed to do so (Figure 3.2). With SACBP the robot was moving into one of the four diagonal
directions for most of the time. This is plausible, as SACBP solves the quadratic program with
a box input constraint (3.35), which tends to find optimal solutions at the corners. MCTS-DPW
resulted in a highly non-smooth trajectory and failed to fully explore the environment. The greedy
approach improved the smoothness, but the robot eventually followed a cyclic trajectory in a small
region of the environment. To our surprise, the ergodic method did not generate a trajectory that
covers the two groups of the targets. This is likely due to the use of a projection-based trajectory
optimization method, which has been recently found to perform rather poorly with rapid re-planning
[39]. To verify our hypothesis, we performed an additional simulation run for the ergodic, in which
the entire trajectory over 200 s was computed and executed without any re-planning. The result
is presented in Figure 3.2 as Ergodic (Open-Loop). With this open-loop computation, the robot
was not stuck any more. However, it strayed into the lower-left corner where there did not exist
any target. This is because the updated, refined beliefs were not reflected into the trajectory. As a
result, the worst entropy value among the targets at t = 200 s was 2.37, which is worse than SACBP
according to Figure 3.1. Among all the baselines implemented, T-LQG performed the worst, leaving
large estimation covariances even after the full 200 s. This is due to the insufficient time budget to
solve the NLP, which indicates that T-LQG is not suited for high-dimensional belief space planning
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 53
Figure 3.3: (Left) The robot is attached to the rectangular object. (Middle) The history of the l2norm of the residual between the goal state and the true object state averaged over 20 runs. SACBPwith N = 10 samples successfully brought the object close to the goal. The reduction of the residualnorm was much slower for MCTS-DPW. T-LQG was not as successful either, regardless of whetherthe policy was derived offline (without re-planning) or online (with re-planning), although it eventually achieved similar residual norms to SACBP. Belief iLQG resulted in largeovershoots at around 2 s and 11 s . (Right) Computation time of SACBP was increased from the
multi-target tracking problem due to increased complexity related to the continuous-discrete beliefdynamics, but still achieved a reasonable value. Note that the computation times of the offline
algorithms were significantly longer and are not shown in this plot.
problems such as this active multi-target tracking task.
3.4.2 Object Manipulation under Model Uncertainty
This problem is identical to the model-based Bayesian reinforcement learning problem studied in
[145], therefore a detailed description of the nonlinear dynamics and the observation models are
omitted. See Figure 3.3 for the illustration of the environment. A 2D robot attached to a rigid body
object applies forces and torques to move the object to the origin. The object’s mass, moment of
inertia, moment arm lengths, and linear friction coefficient are unknown. These parameters as well
as the object’s 2D state need to be estimated using EKF, with noisy sensors which measure the
robot’s position, velocity, and acceleration in the global frame. The same values for tf − t0, ∆to,
∆tc, tcalc as in the previous problem are assumed. Each simulation is run for 20 s. SACBP uses
ε = 0.04 s and N = 10. The nominal control for SACBP is a position controller whose input is the
mean x-y position and the rotation estimates of the object. The cost function is quadratic in the
true state x and control u, given by 12x
TCxx+ 12u
TCuu. Taking expectations yields the equivalent
cost in the Gaussian belief space c(b, u) = 12µ
TCxµ+ 12 tr(CxΣ)+ 1
2uTCuu, where Σ is the covariance
matrix. We let terminal cost h be the same as c except that it is without the control term.
We compared SACBP against (i) MCTS-DPW in the Gaussian belief space, (ii) belief iLQG, and
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 54
(iii) T-LQG. MCTS-DPW uses the same planning horizon as SACBP, and is set to draw N = 240
samples so that the computation time is approximately equal to that of SACBP. Furthermore,
MCTS-DPW uses the position controller mentioned above as the rollout policy, which is suggested
in [145].
The policy computation with belief iLQG is performed offline over the entire simulation horizon
of 20 s. The solver is initialized with a nominal trajectory generated by the same position controller.
The entire computation of the policy takes 5945 s, a significant amount of time until convergence
to a locally optimal affine feedback policy. Note however that the online policy execution can be
performed instantaneously.
For T-LQG, we test the algorithm in both the online and the offline modes. The online mode is
equivalent to the implementation for the active multi-target tracking problem. Re-planning happens
when the end of the planning horizon is reached or the symmetric KL-divergence surpasses dth = 25,
which was rarely exceeded during the simulation for this problem. Furthermore, the max cpu time
parameter of Ipopt is adjusted so the computation time is comparable to SACBP and MCTS-DPW.
This is because the full optimization over 10 timesteps (i.e. 2 s) takes 2.5 s on average and 7 s in the
worst case, which is better than in the active multi-target tracking problem but still prohibitively
slow for online control computation, taking about 15× tcalc to 45× tcalc with tcalc = 0.15 s. On the
other hand, the offline mode computes the closed-loop policy once for the entire simulation horizon
without online re-planning or limiting max cpu time. This setup is identical to belief iLQG. In this
mode, it takes T-LQG 1065 s to compute the policy, which is about 5 to 6 times faster than belief
iLQG. This improved efficiency is congruous with the complexity analysis provided in [123]. Note
also that we use the aforementioned position controller to initialize the NLP solver in both the online
and the offline modes.
Overall, the results presented in Figure 3.3 demonstrate that SACBP outperformed all the base-
lines in this task with only 10 Monte Carlo samples, bringing the object close to the goal within 10 s.
Although the computation time increased from the previous problem due to the continuous-discrete
filtering, it still achieved near real-time performance with less than 3× tcalc s on average. Compared
to SACBP, reduction of the residual norm was slower for MCTS-DPW and online T-LQG. The two
offline algorithms tested, belief iLQG and offline T-LQG, both had a large overshoot at around 2 s
and 3 s, respectively. We suppose that this was caused by the offline nature of the policies, as well
as a mismatch between the discrete-time model used for planning and the continuous-time model
employed for dynamics simulation. Another large overshoot for belief iLQG at 11 s was likely due
to a locally optimal behavior of the iLQG solver.
3.5 Conclusions
In this chapter we present SACBP, a novel novel online algorithm for fast, general-purpose belief
CHAPTER 3. FAST APPROXIMATE BELIEF SPACE PLANNING 55
space planning. We view the stochastic belief dynamics as a hybrid system with time-driven switch-
ing and derive the optimal control perturbation based on the perturbation theory of differential
equations. The resulting algorithm extends the framework of SAC to stochastic belief dynamics and
is highly parallelizable to run in near real-time. The rigorous mathematical analysis shows that the
notion of mode insertion gradient can be generalized to stochastic hybrid systems, which leads to
the property of SACBP that the algorithm is expected to perform at least as good as the nominal
policy with an appropriate choice of the perturbation duration. Through an extensive simulation
study, we have confirmed that SACBP outperforms other online and offline algorithms, including
a greedy approach and non-myopic closed-loop planners that are based on approximate dynamic
programming and/or local trajectory optimization. In future work, we are interested in considering
a distributed multi-robot version of SACBP as well as problems with hard state constraints. Other
research directions include providing additional case studies for more complex systems with efficient
implementation.
Chapter 4
Fast Risk-Sensitive Planning with
Probabilistic Generative Models
In Chapter 2 and 3, we addressed challenges concerning state uncertainty for robots operating in
open and interactive workspaces. Our coherent strategy was to form plans in the belief space that
optimally reduce state uncertainty through sequential control actions and recursive Bayesian infer-
ence. The belief space planning methods proposed in the previous chapters are effective to overcome
challenges arising from epistemic uncertainty about the current state of the system. However, belief
space planning alone does not address all the difficulties of the future state uncertainty that origi-
nates from transition uncertainty, as discussed in Chapter 1 (Section 1.1); transition uncertainty is
difficult to diminish since it is often inherent in the dynamics, especially when the system involves
other decision-making agents (e.g. humans or human-driven vehicles). Even worse, such systems are
often deemed safety-critical, meaning that any failure that the robot makes can cause severe physi-
cal damage to itself or the other agents. Nevertheless, they are of practical importance in robotics
industry, including self-driving vehicles at highway speeds and autonomous drones performing agile
outdoor flights. Therefore, it is crucial for planning and control algorithms to be designed so that
the transition uncertainty is properly addressed and that the system remains safe under various
operating conditions.
In this chapter, we shift our attention to transition uncertainty in safety-critical systems. We
propose an online stochastic control algorithm that approximately optimizes a risk-sensitive objective
called the entropic risk measure. Our method is an extension of the Stochastic SAC algorithm
presented in Chapter 3, therefore it inherits many desirable properties of Stochastic SAC, such
as fast computation and high performance. Compared to Stochastic SAC that only considers the
expected objective value, however, the proposed method is aware of the variance of the underlying
56
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 57
distribution and thus can mitigate “rare but catastrophic” events. We name this new algorithm Risk-
Sensitive Sequential Action Control (RSSAC). RSSAC achieves this extension without increasing
computational complexity from Stochastic SAC. Furthermore, RSSAC does not require knowledge
of the analytical form of the distribution, nor is it limited to a certain class of distributions such as
Gaussian; Monte Carlo samples from a probabilistic generative model suffices to run this algorithm.
As a representative application, we choose to focus on an autonomous navigation task for a ground
robot among many mutually-interacting humans, where the robot must smoothly reach its goal
without a collision. To model transition uncertainty in the humans, we employ Trajectron++ [134],
a deep-learned generative model that performs multimodal probabilistic trajectory forecasting. We
demonstrate that the combined RSSAC-Trajectron++ framework can promote safety by inducing
a proactive robot behavior that avoids risky situations. The methods proposed in this chapter are
based on our prior work [108].
4.1 Introduction
As autonomous robots expand their workspace to cage-free, social environments, they must be
designed as safety-critical systems; failures in avoiding collisions with humans sharing the workspace
result in catastrophic accidents. Safe and efficient robot navigation alongside many humans is still a
challenging problem in robotics, especially due to the potentially unpredictable and uncooperative
nature of human motion. We propose an effective solution to this problem via risk-sensitive stochastic
optimal control, wherein desired collision avoidance and goal reaching motion is achieved by a
cost function, a risk-sensitivity parameter, and dynamic optimization. Specifically, we extend the
Stochastic Sequential Action Control (SAC) algorithm [111, 112] to a risk-sensitive setting through
the use of exponential disutility [174], the objective often referred to as the entropic risk measure [97].
The proposed sampling-based algorithm, which we name Risk-Sensitive Sequential Action Control
(RSSAC), is a stochastic nonlinear model predictive control (NMPC) algorithm that optimally
improves upon a given nominal control with a series of control perturbations. Unlike many other
control-theoretic stochastic NMPCs [156, 52, 175], RSSAC is not limited to a particular class of
distributions such as Gaussian, nor does it need to know the analytical form; it only requires a
black-box probabilistic generative model. Leveraging this property and recent advances in machine
learning for modeling multi-agent behavior, we combine RSSAC with Trajectron++ [134], a state-of-
the-art generative model for predicting the many possible future trajectories of multiple interacting
agents accurately and efficiently. The overall framework constitutes a prediction-control pipeline for
safe robot navigation, which is shown to be capable of interacting with more than 50 pedestrians
simultaneously while avoiding collisions and steering the robot towards its goal.
The contributions of the present work are two-fold. First, the extension of the Stochastic SAC
to risk-sensitive optimal control is presented as a theoretical contribution. This is achieved by
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 58
Figure 4.1: The proposed RSSAC-Trajectron++ framework is effective for safe robot navigationin a social environment densely populated with humans. (Left) A simulation environment withreal human trajectories from the UCY/UNIV scene [90], overlaid with predictions sampled fromTrajectron++. (Right) A risk-sensitive robot safely navigating itself alongside 5 humans.
generalizing the mode insertion gradient [172, 6] to the entropic risk measure. Second, the combined
RSSAC-Trajectron++ framework is presented as a novel modular approach to safe crowd-robot
interaction with dynamic collision avoidance. RSSAC takes advantage of the parallelizability of
Monte Carlo sampling, leading to an efficient GPU implementation with real-time performance.
We conduct simulation studies as well as a real-world experiment to demonstrate that the explicit
probabilistic prediction of Trajectron++ and risk-sensitive optimization of RSSAC allow a robot to
interact with many humans safely and efficiently. They also reveal that the risk sensitivity parameter
adds an additional degree of freedom in determining the type of interaction behavior exhibited by
the robot. In particular, high risk-sensitivity induces a proactive robot behavior that avoids high-
stakes states, for example by yielding to oncoming humans. We further show that such behavioral
shift is not achieved by tuning conventional cost function parameters.
The rest of the chapter is organized as follows. Section 4.2 summarizes the relevant prior work.
We define the crowd-robot interaction problem for safe navigation in Section 4.3. The RSSAC
algorithm is described in Section 4.4 along with implementation details and a run-time analysis.
Section 4.5 presents simulation and experimental results. The paper concludes in Section 4.6 with
future work.
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 59
4.2 Related Work
We first review relevant work in dynamic collision avoidance and safe robot navigation in Section
4.2.1. We then focus on Stochastic SAC (Section 4.2.2) and data-driven trajectory forecasting
methods (Section 4.2.3), on which our RSSAC-Trajectron++ framework is built.
4.2.1 Dynamic Collision Avoidance and Safe Robot Navigation
There exists a vast body of literature in the field of dynamic collision avoidance and safe robot
navigation. Coordinated collision avoidance is widely studied in the multi-robot systems literature,
with methods including mixed integer programming [102], reciprocal velocity obstacles [160], buffered
Voronoi cells [178, 170], Nash-equilibria-based differential games [107], and safety barrier certificates
[168]. These methods do not explicitly model uncoordinated agents or motion uncertainty, which
are both crucial aspects in human motion; later in Section 4.5 we observe collisions for [170] due
to humans breaking the coordination assumption. A notable exception is [95], where the authors
propose probabilistic safety barrier certificates for bounded motion uncertainty. An alternative
approach suited to bounded uncertainty is the Hamilton-Jacobi (HJ) reachability analysis [157, 26],
although it suffers from the curse of dimensionality in solving the Hamilton-Jacobi-Isaacs equation.
Another common approach to modeling motion uncertainty is via the use of probability distribu-
tions. MDP and POMDP frameworks have been applied to robot navigation and collision avoidance
[152, 21] and are best suited to discrete action spaces. Stochastic Optimal Control (SOC), especially
sampling-based model predictive control (MPC), is also gaining increasing attention due to its abil-
ity to handle stochastic uncertainty and incorporate prediction of possible future outcomes. Recent
methods applied to safe robot navigation and dynamic collision avoidance include exhaustive search
with motion primitives [137], path-integral control [52], and information-theoretic control [176].
While the latter two approaches can find approximately optimal solutions under certain assump-
tions, they require the form of distribution to be Gaussian. In addition, it is often not satisfactory to
only optimize for the expected cost, especially for safety-critical systems. There exist many possible
ways to endow robots with risk-awareness [97], such as chance-constrained optimization [16], condi-
tional value at risk (CVaR) [136], and entropic risk measure [99]. All of these methods come with
advantages and drawbacks, but in general there is a trade-off between the complexity of modeling
assumptions and the statistical safety assurances. Our work employs the entropic risk measure which
has been extensively studied in control theory [173, 51]. Albeit not yielding statistical safety assur-
ances, our algorithm can compute control inputs in real-time for highly dynamic environments with
multimodal predictive distributions, which is quite challenging for chance constrained optimization
or CVaR-based methods.
Lastly, we briefly highlight similarities and differences of our work from policy learning methods
such as imitation learning [120] or end-to-end deep reinforcement learning [46, 25]. Similar to these
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 60
methods, our framework applies machine learning to extract features from multi-agent interactions.
However, we use these features to make explicit probabilistic predictions about other agents which
are then incorporated into optimization by the model-based controller. Our modular approach has
two advantages. First, explicit prediction brings about better interpretability and explainability
of the resulting interaction. Second, our model-based controller allows the robot’s behavior to be
changed at run time by simply modifying the cost function design or the risk sensitivity parameter,
while other methods would need re-training to do so as the policy itself is data-driven.
4.2.2 Stochastic Sequential Action Control
SAC in its original form is a deterministic model predictive control algorithm for nonlinear hybrid
systems [6, 158]. In contrast to local trajectory optimization, at each time the algorithm perturbs
a given (open-loop or closed-loop) nominal control to optimize a local improvement of the total
receding-horizon cost, a quantity known as the mode insertion gradient [172]. SAC is shown to yield
highly efficient control computation with high-frequency re-planning, which often achieves better
performance than local trajectory optimization methods. In Chapter 3, we generalized the mode
insertion gradient to stochastic systems and solved challenging belief space planning problems, where
our approach significantly outperformed other state-of-the-art methods under both state and state
transition uncertainties. The proposed RSSAC algorithm is an extension of this algorithm to risk-
sensitive optimal control in which the cost functional is the entropic risk measure. This makes the
algorithm more suitable for problems where safety is concerned, such as the crowd-robot interaction
one that we address in this chapter.
4.2.3 Multi-Agent Trajectory Modeling from Data
The problem of multi-agent trajectory forecasting has been substantially studied in the literature.
Many of earlier works focus on producing the best single output trajectory from a history of state
observations. They include the earliest works that view humans as physical objects whose mo-
tions are determined by imaginary Newtonian forces [62], as well as other approaches that solve
a time-series regression problem by Gaussian processes [167], inverse reinforcement learning [88],
or recurrent neural networks (RNNs) [3]. However, since human behavior is rarely deterministic
or unimodal, deep learning-based generative approaches have emerged as state-of-the-art trajectory
forecasting methods, due to recent advancements in deep generative models [146, 53]. Notably, they
have caused a shift from focusing on predicting the single best trajectory to producing a distribution
of potential future trajectories. This is advantageous in autonomous systems as full distribution
information is more useful for downstream tasks, e.g. motion planning and decision making where
information such as variance can be used to make safer decisions. Most works in this category
use a recurrent neural network architecture with a latent variable model, such as a Conditional
Variational Autoencoder (CVAE) [146], to explicitly encode multimodality (e.g. [87, 69, 130]), or a
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 61
Generative Adversarial Network (GAN) [53] to do so implicitly (e.g. [56, 133, 83]). Trajectron++
[134] belongs to the former, making use of a CVAE to explicitly model the multimodality of human
behavior with a latent variable. Specifically, it uses a discrete Categorical latent variable which aids
in performance as well as interpretability, since different behavior modes are explicitly represented
and able to be visualized. Designed for downstream robotic motion planning and control tasks,
Trajectron++ can also produce predictions that are conditioned on candidate future motion plans
of the ego robot. We leverage this capability in a real-world experiment in Section 4.5 and show
that the robot-future-conditional prediction significantly improves both safety and efficiency of the
robot navigation.
4.3 Problem Statement
4.3.1 Dynamics Model
Mirroring Chapter 3 (Section 3.2.1 and 3.4.1), we assume a discrete-time dynamical system model
for other agents (i.e. humans) while employing a continuous-time one for the robot. Recall that this
modeling assumption is practical, as 1) in general the robot receives information of its surrounding
environment at a much lower frequency than actuation; and 2) with this representation one can
naturally handle potentially agile robot dynamics without coarse time discretization. Let xr(t) ∈ Rn
denote the fully-observable state of the robot at time t, whose dynamics are modeled by
xr(t) = fr(xr(t)) +H(xr(t))u(t), (4.1)
where u(t) ∈ U ⊆ Rm is the control and U is a bounded convex set. We assume that the dynamics
are deterministic and control-affine. As the robot navigates in the environment, it receives position
information on humans in the scene and updates it at discrete times tkk≥0, with interval time
∆to. Similar to related works [178, 170] we do not assume velocity information to be available to the
robot. Let pi(tk) ∈ R2 denote the position of the human with label i ∈ 1, . . . , N. From tk−1 to tk,
the position pi(tk−1) changes to pi(tk) with difference yik. Note that in this chapter we assume full
observability of the human positions. From the robot’s continuous-time perspective, this is viewed
as a periodic jump discontinuity given bypi(tk) = pi(t−k ) + yik
pi(t) = pi(tk) ∀t ∈ [tk, tk+1),(4.2)
where t−k is infinitesimally smaller than tk. The dynamics of the joint state:
where β ≥ 0 determines the relative weight between the terminal and instantaneous costs.
Having specified the cost functional J , we define the (risk-neutral) stochastic optimal control
problem as follows.
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 63
Problem 4.1 (Risk-Neutral Stochastic Optimal Control).
minimizeu
ED[J ]
subject to (4.1), (4.2) ∀i ∈ 1, . . . , N ∀k ∈ 0, . . . , T
xr(t0) = xr0, pi(t0) = pi0 ∀i ∈ 1, . . . , N
u(t) ∈ U ∀t ∈ [t0, tT )
Remark 4.1. Note that the definition above yields an open-loop optimal control problem. Even
though this is suboptimal compared to closed-loop planning that seeks for a feedback policy, in practice
we will solve Problem 4.1 repeatedly as MPC, which has been found to work well in prior work
(e.g. [137]). This open-loop assumption also makes the subsequent analysis presented in Section 4.4
straightforward.
4.3.3 Entropic Risk Measure
The formulation of Problem 4.1 ignores the safety-critical aspect of collision avoidance, as simply
optimizing the expected value fails to take into account the shape of the distribution. A remedy is
to introduce the following entropic risk measure with risk sensitivity parameter θ > 0:
RD,θ(J) ,1
θlog(ED[eθJ ]
). (4.7)
As long as the robot trajectory xr is bounded for all admissible control, J becomes a bounded random
variable and RD,θ(J) is finite. It is known that this transformation approximately decouples the
mean and variance:
RD,θ(J) ≈ ED[J ] +θ
2VarD(J) (4.8)
for small θVarD(J) [174]. The meaning of θ is now clear; it is a parameter that determines how much
we care about the variability of the cost in addition to the mean. Larger θ increases risk sensitivity,
while the risk-neutral objective ED[J ] is recovered as θ → 0+.
Replacing the expectation in Problem 4.1 with (4.7), we obtain the following risk-sensitive optimal
control problem:
Problem 4.2 (Risk-Sensitive Stochastic Optimal Control).
minimizeu
RD,θ(J)
subject to (4.1), (4.2) ∀i ∈ 1, . . . , N ∀k ∈ 0, . . . , T
xr(t0) = xr0, pi(t0) = pi0 ∀i ∈ 1, . . . , N
u(t) ∈ U ∀t ∈ [t0, tT )
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 64
4.4 Risk-Sensitive Sequential Action Control
4.4.1 Review of Stochastic Sequential Action Control
Even solving Problem 4.1 is intractable due to potential non-convexity in ccol and complexity in
D. An efficient, approximate MPC solution can be obtained via Stochastic SAC introduced in
Chapter 3. In this framework, we seek the optimal perturbation of a given nominal control such
that the expected value of the mode insertion gradient [172, 6], which quantifies local effects of
the perturbation on the cost functional J , is minimized. We assume that the nominal control is
an open-loop control schedule u (although it is straightforward to extend our method to deal with
closed-loop nominal control policies [111, 112]). The perturbed control is defined as
uε(t) ,
v if t ∈ (τ − ε, τ ]
u(t) otherwise,(4.9)
where v ∈ U , τ ∈ (t0, tT ), ε ≥ 0 are the perturbation parameters. The perturbation uε yields
deterministic, perturbed state trajectory xε and cost Jε under a specific sample from D. The mode
insertion gradient is the sensitivity of the cost J to the perturbation duration ε, with v and τ fixed:
∂+J
∂ε
∣∣∣∣ε=0
, limε→0+
Jε − Jε
. (4.10)
The value of the mode insertion gradient is given by
∂+J
∂ε
∣∣∣∣ε=0
=1
2vTRv + ρr(τ)TH(xr(τ))(v − u(τ))− 1
2u(τ)TRu(τ), (4.11)
where xr is the robot state trajectory induced by nominal control u, and ρr is the adjoint variable
matching the dimensionality of the robot state. Specifically, it follows the ordinary differential
equation (ODE):
ρr(t) = − ∂
∂xrc(x(t), u(t))−
(∂
∂xrfr(xr(t)) +
∂
∂xrH(xr(t))u(t)
)T
ρr(t), (4.12)
with boundary condition ρr(tT ) = ∂∂xr
h(x(tT )). One can obtain (4.11) and (4.12) by following the
same derivation as presented in Chapter 3 (Section 3.2.1). As the joint state x(t) is in fact a random
vector due to stochasticity in human motion, so is ρr(t) and the deterministic ODE (4.12) is only
valid under a specific sample from D. Taking the expectation of (4.11) yields the expected mode
insertion gradient:
ED[∂+J
∂ε
∣∣∣∣ε=0
]=
1
2vTRv + ED[ρr(τ)]TH(xr(τ))(v − u(τ))− 1
2u(τ)TRu(τ). (4.13)
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 65
4.4.2 Generalized Mode Insertion Gradient
Under a weak regularity condition on the dynamics function (4.1), and for the cost J defined by
(4.4), (4.5), (4.6), one can show that the robot trajectory xr is bounded for all admissible control,
and that the perturbed cost Jε is globally Lipschitz continuous in ε ≥ 0:
|Jε − J | ≤ εK, (4.14)
under the same random sample from D for J and Jε. The value of the constant K is not dependent on
the samples. Therefore, the dominated convergence theorem allows us to interchange the derivative
and the expectation:
ED[∂+J
∂ε
∣∣∣∣ε=0
]=∂+ED[J ]
∂ε
∣∣∣∣ε=0
. (4.15)
The statement of the regularity condition and the proof of the key results (i.e. (4.14) and (4.15))
are deferred to Appendix B. The right-hand side of (4.15) is the mode insertion gradient generalized
to the case of risk-neutral stochastic optimal control (i.e. Problem 4.1). If the optimal value with
respect to v is negative for some τ , then there exists a sufficiently small ε for which the perturbation
defined by (v, τ, ε) will reduce the expected cost. If instead it is zero, we can think of the nominal
control as satisfying a local optimality condition.
4.4.3 Extension to Entropic Risk Measure
Sections 4.4.1 and 4.4.2 have provided a brief summary of the Stochastic SAC framework presented
in Chapter 3. Now we are set to derive the generalized mode insertion gradient for the entropic risk
measure RD,θ(J), which is a novel contribution of this chapter.
Lemma 4.1. Suppose that the total cost J satisfies (4.14). Then, for θ > 0 the following relation
holds:
ED[∂+e
θJ
∂ε
∣∣∣∣ε=0
]=∂+ED[eθJ ]
∂ε
∣∣∣∣ε=0
. (4.16)
Proof. Let Jε and J be the perturbed and the nominal cost, respectively. We have
|eθJε
− eθJ | = |eθ(Jε−J) − 1|
eθJ(4.17)
≤ eθ|Jε−J| − 1, (4.18)
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 66
where we used |ex − 1| ≤ e|x| − 1 and eθJ ≥ 1. Substituting (4.14) and dividing by ε > 0, we obtain
|eθJε − eθJ |ε
≤ eθKε − 1
ε(4.19)
Let g(ε) denote the right-hand side of this inequality, which is a strictly monotone increasing function
for ε > 0. Now let |fε(J)| denote the left-hand side as a function of J parameterized by ε. Take some
positive ε0 and a sequence εnn≥0 converging to 0+ as n→∞. It follows that ∀n ≥ 0, |fεn(J)| ≤g(ε0). The dominated convergence theorem applies as g(ε0) is a finite, deterministic constant:
ED[ limn→∞
fεn(J)] = limn→∞
ED[fεn(J)], (4.20)
which is equivalent to (4.16).
Theorem 4.1 (Mode Insertion Gradient of Entropic Risk). Suppose that the regularity condition
mentioned in Section 4.4.2 is met so the robot trajectory x is bounded under admissible control and
the cost J satisfies (4.14). Then, for fixed v and τ the mode insertion gradient of the entropic risk
measure ∂+∂εRD,θ(J)|ε=0 exists and is given by
∂+
∂εRD,θ(J)
∣∣∣∣ε=0
=1
2vTRv +
ED[eθJρr(τ)]T
ED[eθJ ]H(xr(τ))(v − u(τ))− 1
2u(τ)TRu(τ), (4.21)
where J inside the expectations is the cost value under the nominal control u.
Proof. As the robot trajectory is bounded, J is a bounded random variable and the value of RD,θ(J)
is finite. The chain rule gives
∂+
∂εRD,θ(J)
∣∣∣∣ε=0
=1
θED[eθJ ]
∂+ED[eθJ ]
∂ε
∣∣∣∣ε=0
(4.22)
=1
ED[eθJ ]ED[eθJ
∂+J
∂ε
∣∣∣∣ε=0
], (4.23)
where we also used Lemma 4.1. Substituting (4.11) and simplifying the terms complete the proof.
The mode insertion gradient of entropic risk (4.21) is a generalization of the stochastic mode
insertion gradient (4.15). Indeed, for the risk neutral case (i.e. θ = 0) the two equations match.
This enables us to extend the Stochastic SAC algorithm to risk-sensitive optimal control problems
without changing the structure of the algorithm.
4.4.4 RSSAC Algorithm
The core of RSSAC is the following optimization problem, which substitutes Problem 4.1 (if θ = 0)
and 4.2 (if θ > 0).
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 67
Problem 4.3 (Mode Insertion Gradient Optimization).
minimizev,τ
∂+
∂εRD,θ(J)
∣∣∣∣ε=0
subject to τ ∈ (t0 + tcalc, tT )
v(τ) ∈ U ∀τ ∈ (t0, tT ),
where tcalc is a computation time budget. The objective can be evaluated by Monte Carlo
sampling the joint dynamics (4.1), (4.2) under the nominal control and backward integration of the
adjoint dynamics (4.12). Fixing τ , Problem 4.3 is a quadratic minimization over v under a convex
constraint. The optimal value v∗(τ) can be obtained analytically for simple constraints for U such as
a box constraint or a norm inequality constraint with R being a scaled identity matrix. Optimizing
τ is achieved by solving v∗(τ) multiple times and searching for the minimum value. There is only
a finite number of τ to consider1 since in practice we use numerical integration, such as the Euler
scheme with some discrete step size, to integrate the robot dynamics. The optimal mode insertion
gradient is always non-positive, and if it is negative then there exists some ε > 0 such that the
entropic risk measure is reduced as a result of the control perturbation. The value of ε can be either
specified or searched. If instead the mode insertion gradient is zero, we set ε = 0 and choose not to
perturb.
The pseudo-code of the RSSAC algorithm is presented in Algorithm 4.1. Importantly, the Monte
Carlo sampling part is naturally parallelizable. Note that the risk sensitivity has no effect unless we
draw at least M = 2 samples. The re-planning happens at every ∆tr seconds, allowing for a variable
number of humans to be considered over time. The complexity of the algorithm is O(NMT ) where
N denotes the number of humans, M the number of samples, and T the planning horizon.
4.4.5 Implementation Details
In our implementation, we use the double integrator model for the robot’s dynamics: xr(t) =
(xp(t)T, xv(t)T)T ∈ R4, xp(t) = xv(t), xv(t) = u(t). These dynamics are integrated using the Euler
scheme with step size ∆tc = 0.02 s. The cost weight matrices are Q = Diag(0.5, 0.5, 0, 0) and
R = 0.2I2×2. The collision cost is the following sum of Gaussian functions centered at each human:
ccol(x(t)) =
N∑i=1
α exp
(−‖xp(t)− pi(t)‖22
2λ
), (4.24)
1In this chapter, our search space for τ extends to tT , unlike t0 + tcalc + ∆to in Chapter 3. That is, we allow thecontroller to perturb the system at a much later time than assumed in the original Stochastic SAC algorithm. Wenote that this larger search space is in agreement with the SAC algorithm presented in [6], and have found that thismodification improves the performance in practice.
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 68
Algorithm 4.1 Control Computation with RSSAC
INPUT: Initial joint state x(t0), reference trajectory r, nominal control schedule u(t) for t ∈ [t0, tT ].OUTPUT: Optimally perturbed control schedule uε
1: for j = 1:M do2: Forward-simulate the joint dynamics (4.1), (4.2) while sampling human transitionsyik
j(1≤i≤N,1≤k≤T ) from D
3: Compute sampled cost Jj
4: Backward-simulate adjoint robot trajectory ρjr with sampled human transitions (4.12)5: end for6: Compute Monte Carlo estimate: ED[eθJρr] ≈ 1
M
∑Mj=1 e
θJjρjr and ED[eθJ ] ≈ 1M
∑Mj=1 e
θJj
7: Solve Problem 4.38: Specify ε or search by re-simulating the dynamics9: uε ← PerturbControlSchedule(u, v∗, τ∗, ε) (4.9)
10: return uε
with peak parameter α = 100 and bandwidth parameter λ = 0.2. The relative weight of the terminal
cost is set to β = 0.1. The reference trajectory r is a straight line connecting the initial robot position
to the goal position at a constant target speed, and is re-planned whenever ‖xr(t) − r(t)‖2 > 2 m.
The control constraint is ‖u(t)‖2 ≤ umax with umax = 5.0, 2.5 m s−2 in the simulation and the
real-world experiment, respectively. Monte Carlo sampling is parallelized on a GPU with sample
size M = 30. The planning horizon is 4.8 s, which corresponds to T = 12 steps of human motion
prediction with measurement interval ∆to = 0.4 s.
In contrast to prior work [111, 112] and Chapter 3, we searched for ε from 0, 1× 10−3, 2× 10−3,
4× 10−3, 8× 10−3, 1.6× 10−2, 2× 10−2, 4× 10−2, 8× 10−2 s by re-simulating the perturbed dy-
namics. Furthermore, the nominal control u is a simple MPC-style search algorithm and takes the
form
u(t) =
us(t) if t ∈ [t0 + tcalc, t0 + tcalc + ∆to]
upr(t) otherwise,(4.25)
where upr is the perturbed control from the previous iteration, and us is either the same as upr or
chosen from a set of constant control inputs (a cos(b), a sin(b)) with a ∈ 0.4umax, 0.8umax and
b ∈ 0, π/4, π/2, . . . , 2π. The best us is chosen based on the evaluation of RD,θ(J) for each nominal
control candidate using the Monte Carlo samples. This nominal search is similar to the iterative
update scheme presented in [158] in that the previously-computed perturbation is used in the next
iteration, but is different in that we also insert us prior to running Algorithm 4.1. The nominal
search can be alternatively considered as a simplified version of a tree search with motion primitives
[137] with only 17 control choices and tree depth 1 (i.e. us only lasts for one discrete-time interval
∆to). We found this simple nominal search to be effective compared to constant nominal control
[6, 158] while retaining low computational cost. Note that if robot-future-conditional prediction is
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 69
0.00.10.20.30.40.5Normalized Distance to Goal
0.0
0.5
1.0
1.5
Min
. Dist
. to
Hum
an [m
] (a) ETH
Nominal SearchExhausitve SearchBICRSSAC (Ours)Collision Line
0.00.10.20.30.40.5Normalized Distance to Goal
0.0
0.5
1.0
1.5
Min
. Dist
. to
Hum
an [m
] (b) HOTEL
Nominal SearchExhausitve SearchBICRSSAC (Ours)Collision Line
0.000.250.500.751.00Normalized Distance to Goal
0.0
0.5
1.0
1.5
Min
. Dist
. to
Hum
an [m
] (c) UNIV
Nominal SearchBICRSSAC (Ours)Collision Line
Figure 4.2: Quantitative results from 100 runs show that risk-neutral (i.e. θ = 0) RSSAC furtherimproves the performance of Nominal Search Only as the theory suggests, achieving both safety andefficiency. Note that the farther up and to the right, the better, as the x-axis is flipped. ExhaustiveSearch could not scale to the UNIV scene with more than 50 humans. BIC resulted in multiplecollisions. Error bars show standard deviation.
used, the distribution D is different for each of the 17 candidate nominal controls (4.25), allowing
the robot to consider the effect of various future robot trajectories on reactive human behavior.
We implemented all the control code in Julia and achieved real-time performance with re-planning
interval ∆tr = tcalc = 0.1 s.
4.5 Results
4.5.1 Simulation Results
We evaluated the performance of RSSAC in simulation and compare against three baseline collision
(Section 4.4.5) Only, 3) Exhaustive Tree Search with Motion Primitives [137]. BIC is a reciprocal
collision avoidance method that has similar computational complexity to the Velocity Obstacle (VO)
approaches. Unlike VO, BIC does not require the velocity information of other agents to be available
and is applicable to high-order linear dynamics. As our robot is a double-integrator and only obtains
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 70
0.00.10.20.30.40.5Normalized Distance to Goal
0.91.01.11.21.31.41.51.6
Min
. Dist
. to
Hum
an [m
] (a) ETH
RSSAC, = 0.0RSSAC, = 1.0
0.00.10.20.30.40.5Normalized Distance to Goal
0.91.01.11.21.31.41.51.6
Min
. Dist
. to
Hum
an [m
] (b) HOTEL
RSSAC, = 0.0RSSAC, = 1.0
0.000.250.500.751.00Normalized Distance to Goal
0.91.01.11.21.31.41.51.6
Min
. Dist
. to
Hum
an [m
] (c) UNIVRSSAC, = 0.0RSSAC, = 1.0
Figure 4.3: Compared to the risk-neutral case in Fig. 4.2, RSSAC with θ = 1.0 significantly reducesthe standard deviation of the minimum robot-human distance by 11%, 12%, and 24% in (a), (b),and (c), respectively. The risk sensitivity trades off the stochastic collision cost and the deterministictracking cost, which results in increased standard deviation in the x-axis in (a) and (b), and overalldistance increase in (c) where the scene was most densely-populated.
position measurements, BIC is a more suitable reciprocal collision avoidance approach than VO. BIC
is minimally invasive in that an arbitrary control input is projected onto a convex polygon in the
control space. Collision avoidance is guaranteed as long as all the agents respect their own BIC
constraint. We use the LQ-Tracking cost (i.e. (4.5) without ccol) to solve a deterministic optimal
control problem and the BIC constraint adjusts the control input based on the current positions of
humans. The robot is said to be in collision if it is within 40 cm from a human, but we set this
threshold to 80 cm for planning with BIC only to give it an extra safety margin. The exhaustive
search is similar to the nominal search in that it uses samples from D, but the tree depth is T = 4
and at each depth the constant control is chosen from (a cos(b), a sin(b)) with a ∈ 0, 0.6umaxand b ∈ 0, π/4, π/2, . . . , 2π. This results in 94 = 6561 sequences of control inputs to be considered.
Despite parallelization on a GPU and a long replanning interval of ∆tr = tcalc = 0.4 s, Exhaustive
Search never achieved real-time performance. RSSAC, Nominal Search, and Exhaustive Search all
used M = 30 prediction samples (per human) drawn from the Trajectron++ model [134].
For this evaluation, we used three distinct sequences from the publicly available ETH [116]
and UCY [90] pedestrian motion datasets. They consist of real pedestrian trajectories with rich
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 71
multi-human interaction scenarios. Each sequence that we used is a series of consecutive frames
clipped from the ETH/ETH Test scene, the ETH/HOTEL Test scene, and the UCY/UNIV Test
scene. For brevity, we refer to these as the “ETH,” “HOTEL,” and “UNIV” sequences, respectively.
In each scene, the robot starts from a collision free configuration and moves towards a specified
goal while avoiding collisions. The ETH sequence is 10 s long and has 16 total pedestrians. The
HOTEL sequence is also 10 s long and has 8 total pedestrians. The UNIV sequence is the most
challenging with 95 total pedestrians over a horizon of 20 s; there always exists 36 to 54 pedestrians
simultaneously in each frame (see Fig. 4.1). Note that none of those pedestrians recognize or react
to the robot as their motion is a replay from the data. This is the so-called “invisible robot” setting
[25] where the robot has to predict Human-Human interaction only. For this reason, we did not
condition the prediction of Trajectron++ on the nominal control candidates of the robot.
Trajectron++ was trained following the same procedure as detailed in [134]. The model was
trained for 2000 steps with the Adam [78] optimizer and a leave-one-out approach. That is, the
model was trained on all datasets but the one used for evaluation. This training was done off-line
prior to running RSSAC. At run time, RSSAC fetches prediction samples from the trained model
every ∆to = 0.4 s and uses them for the mode insertion gradient optimization. All the online
computation was performed on a desktop computer with an Intel Xeon(R) CPU ES-2650 v3, 62.7
GiB memory, and a GeForce GTX 1080 graphics card.
Fig. 4.2 shows the statistical simulation results of RSSAC and the baseline methods. RSSAC,
Nominal Search, Exhaustive Search were implemented with θ = 0. The x-axis shows the final
distance between the robot and its goal normalized by the initial distance. The y-axis is the minimum
distance between the robot and a human. These performance metrics measure the safety and
efficiency of robot navigation, and are often in conflict. For each method we performed 100 runs
with different random seeds. The goal of the robot was also randomized. As can be seen, RSSAC
was both the safest and the most efficient among those methods that ran in real-time. BIC ended up
in multiple collisions in all the three cases as the humans violated the reciprocity assumption. We
did not test Exhaustive Search for the UNIV case due to its poor scalability; the computation took
about 10× longer than the allocated time budget. We also note that RSSAC did further improve the
performance of the Nominal Search, which itself was already achieving reasonably safe navigation.
4.5.2 Effects of Risk Sensitivity
Next, we studied the effects of risk sensitivity in the context of safe robot navigation among humans.
Fig. 4.3 compares RSSAC with θ = 0 (the same data as in Fig. 4.2) and θ = 1.0. Our risk-sensitive
optimization significantly reduced the standard deviation of the minimum robot-human distance
by 11%, 12%, and 24% for the ETH, HOTEL, and UNIV scenes, respectively. The risk sensitivity
trades off the stochastic collision cost and the deterministic tracking cost, which appears in the ETH
and HOTEL results where the standard deviation of the normalized goal distance slightly increased
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 72
Figure 4.4: Qualitative comparison of RSSAC with θ = 0 (left) and θ = 1.0 (right) in the HOTELscene. These results differ in the minimum robot-human distance by only 3 cm and the normalizedgoal distance by 0.01, but the risk-sensitive robot (right) yields to potentially conflicting humansas opposed to the risk-neutral robot (left). Both simulations used the same random seed. Sampledpredictions from Trajectron++ are also depicted.
by 3.9% and 2.3%. On the other hand, the UNIV scene was so densely populated with humans
that the risk-sensitive robot was unable to approach the goal. Although not necessarily captured by
the quantitative metrics, overall the risk-sensitive robot also exhibited another intriguing behavioral
difference as depicted in Fig. 4.4; the risk-sensitive robot tends to yield to potentially conflicting
humans.
To gain a better understanding of the effects of risk-sensitivity, we designed a simplified intersec-
tion scenario as illustrated in Fig. 4.5. The human in this toy example followed a linear Gaussian
model with a constant mean velocity and was in a potential collision course with the robot. We ran
100 simulations for each θ in 0.0, 0.5, 1.0. The goal position was not randomized in this study. We
also ran two additional sets of 100 simulations wherein we kept θ = 0.0 but varied the collision cost
peak parameter α and the bandwidth λ in (4.24), respectively. This was to elucidate the difference
between risk sensitivity tuning and cost function design. The results are summarized in Fig. 4.6 and
confirm that the more risk-sensitive the robot becomes, the more likely it is to yield to the human.
This consistent trend was not present when changing α or λ only. On the other hand, there was a
positive correlation between the minimum robot-human distance and all the three parameters. This
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 73
-6 -4 -2 0 2 4 6-6
-4
-2
0
2
4
6
Time: 0.00 [s]
x[m]
y[m
]
Ego RobotVelocityTarget TrajectoryEgo Goal
Figure 4.5: A synthetic intersectionscenario with a human. The predic-tion is drawn from a linear Gaussianmodel with a constant mean velocity.
0.0 0.5 1.0
1.4
1.6
1.8
2.0
Min
. Dist
. to
Hum
an [m
]
= 100, = 0.2Risk Sensitivity
100 150 200
1.4
1.6
1.8
2.0= 0.0, = 0.2
Cost Peak
0.20 0.25 0.30
1.4
1.6
1.8
2.0= 0.0, = 100
Cost Bandwidth
0.0 0.5 1.00.5
0.6
0.7
0.8
Prob
abilit
y of
Yie
ldin
g
Risk Sensitivity
100 150 2000.5
0.6
0.7
0.8Cost Peak
0.20 0.25 0.300.5
0.6
0.7
0.8Cost Bandwidth
Figure 4.6: Minimum robot-human distance(top) and empirical probability of yielding (bot-tom) for the synthetic intersection scenario.Changing the risk-sensitivity (left) consistentlyaffected whether or not the robot yields, whilethe other two cost tuning parameters (middleand right) did not.
observation suggests that the risk sensitivity parameter θ affects the global behavior of the robot
that determines the “type” of interaction, whereas cost function tuning affects the local behavior
only (i.e. minimum robot-human distance). Thus, the risk sensitivity parameter can be considered
as an additional degree of freedom in choosing desirable robot behavior.
4.5.3 Real-World Experiment
Our simulation study was accompanied by a real-world experiment with a holonomic robot [171] and
5 human subjects in an indoor environment (see Fig. 4.1). Those subjects were assigned a specific
start and goal positions, and were instructed to walk to each individual goal at normal speeds.
Although the start-goal pairs remained the same, we changed their assignment to the subjects after
each run so the type of interaction remained diverse. The robot started at a known position, but
its exact goal was randomized and not known to the subjects. The positions of all the subjects as
well as the robot were measured by a motion capture system. We used the same parameters as in
the simulation study except umax. A major difference between the simulation and the real-world
experiment is that the robot is “visible” to humans in the experiment, which requires the robot to
take into account resulting human-robot interaction in addition to human-human interaction. This
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 74
0.00.10.20.30.4Normalized Distance to Goal
0.0
0.2
0.4
0.6
0.8
1.0
Min
. Dist
. to
Hum
an [m
]RSSAC, = 1.0RSSAC, = 0.0RSSAC-Conditional, = 1.0RSSAC-Conditional, = 0.0Collision Line
Figure 4.7: Quantitative results of the real-world experiment with 5 human subjects. Using robot-future-conditional predictions with θ = 1.0 achieves the best average performance. Error bars showthe standard deviation of 5 runs with a randomized robot goal and human start-goal assignment.
was achieved by conditioning the prediction of Trajectron++ by the nominal control candidates
for RSSAC, similar to [137]. As explained in Section 4.4.5, this robot-future-conditional prediction
lets the robot reason about the effect of different nominal control candidates on human behavior,
prior to RSSAC perturbation on the best one. We compared the performance of the unconditional
prediction (as in the simulation study) to the conditional prediction for both risk-neutral and risk-
sensitive cases.
The results of the experiment are presented in Fig 4.7. For each setting we performed 5 runs.
The robot with unconditional predictions was either too conservative (θ = 1.0) or unsafe (θ = 0.0).
Using conditional predictions improved overall performance; we did not observe a single collision
with conditional predictions. This supports our hypothesis that the robot-future-conditional pre-
diction facilitates appropriate human-robot interaction. Of the four cases tested, using conditional
predictions with θ = 1.0 achieved the best average performance, as well as the smallest variance in
minimum robot-human distance.
4.6 Conclusions
This chapter focuses on transition uncertainty in safety-critical systems, in particular systems that
involve other decision-making agents such as humans. We propose a novel online algorithm for
risk-sensitive optimal control to take account of risk associated with uncertain human motion. Our
theoretical contribution is a derivation of the mode insertion gradient for the entropic risk measure
under nonlinear, non-Gaussian dynamics. This theory leads to an efficient and high-performing
implementation of RSSAC, an extension of Stochastic SAC that is more suited for safety-critical
CHAPTER 4. FAST RISK-SENSITIVE PLANNING WITH GENERATIVE MODELS 75
applications. Crucially, RSSAC does not require knowledge of the analytical form of the distribu-
tion. To demonstrate its effectiveness, we combine RSSAC with Trajectron++ and apply it to a
crowd-robot interaction problem with dynamic collision avoidance. As a major practical contribu-
tion, we achieve safe and efficient robot navigation among many mutually interacting humans by the
probabilistic, robot-future-conditional prediction of Trajectron++ and the risk-sensitive optimiza-
tion of RSSAC. Furthermore, the risk sensitivity parameter is found to play a key role in determining
the type of interaction behavior, such as yielding. In future work, we plan to apply the RSSAC-
Trajectron++ framework to vehicle dynamics and scenes with heterogeneous agents. We are also
interested in automatically adapting the risk sensitivity parameter so that any specifically-desired
robot behavior emerges.
Chapter 5
Distributionally Robust Planning
under Model Uncertainty
In the previous chapters, we presented online planning and control methods that address two types
of uncertainty—state uncertainty and transition uncertainty—defined in Chapter 1 (Section 1.1).
It remains to discuss the one remaining type of uncertainty, namely model uncertainty. Model
uncertainty arises due to imperfect knowledge of complex dynamical systems, which affects reliability
of robot autonomy. In stochastic environments that we consider in the thesis, this limited knowledge
leads to inaccurate characterization of the underlying probability distribution that describes, for
example, future motion of other interacting agents. In practice, such a mismatch between the model
employed by the robot and the ground-truth most likely exists, and can have serious consequences for
safety-critical systems [29]. This chapter presents a control algorithm that is capable of handling the
distributional mismatch. Specifically, we propose a novel nonlinear MPC for distributionally robust
control, which plans a locally optimal feedback policy against a worst-case distribution within a
given set of plausible distributions. This set is defined by an offline-estimated KL divergence bound
between a user-specified Gaussian distribution and the unknown ground-truth model. Leveraging
mathematical equivalence between distributionally robust control and risk-sensitive optimal control,
we propose an efficient local optimization method to tractably compute the control policy. This
equivalence also indicates that our method serves an algorithm to dynamically adjust the risk-
sensitivity level online for risk-sensitive control. The benefits of the distributional robustness as
well as the automatic risk-sensitivity adjustment are demonstrated in a dynamic collision avoidance
scenario where the predictive distribution of human motion is erroneous. The materials presented
in this chapter are also reported in [109].
76
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 77
5.1 Introduction
Proper modeling of a stochastic system of interest is a key step towards successful control and deci-
sion making under uncertainty. In particular, accurate characterization of the underlying probability
distribution is crucial, as it encodes how we expect the system to behave unexpectedly over time.
However, such a modeling process can pose significant challenges in real-world problems. On the
one hand, we may have only limited knowledge of the underlying system, which would force us to
use an erroneous model. On the other hand, even if we can perfectly model a complicated stochastic
phenomenon, such as a complex multi-modal distribution, it may still not be appropriate for the
sake of real-time control or planning. Indeed, many model-based stochastic control methods require
a Gaussian noise assumption, and many of the others need computationally intensive sampling. In
Chapter 4, we circumvented this computational issue by proposing the RSSAC algorithm, which can
efficiently process samples drawn from data-driven, probabilistic prediction models. Although the
method has been shown effective, recent work [29] also suggests that even state-of-the-art methods
for probabilistic prediction are not accurate enough to be deployed on safety-critical systems. There-
fore, we must design our planner/controller so that the inevitable model uncertainty is considered
appropriately.
The present work addresses this problem via distributionally robust control, wherein a poten-
tial distributional mismatch is considered between a baseline Gaussian process noise and the true,
unknown model that lies within a certain Kullback-Leibler (KL) divergence bound. The use of
the Gaussian distribution is advantageous to retain computational tractability without the need for
sampling in the state space. Our contribution is a novel model predictive control (MPC) method
for nonlinear, non-Gaussian systems with non-convex costs. This controller would be useful, for
example, to safely navigate a robot among human pedestrians while the stochastic transition model
for humans is not perfect.
It is important to note that our contribution is built on the mathematical equivalence between
distributionally robust control and risk-sensitive optimal control [117]. As reviewed in Chapter 4,
risk-sensitive optimal control seeks to optimize the following entropic risk measure [97]:
Rp,θ(J) ,1
θlogEp [exp(θJ)] , (5.1)
where p is a probability distribution characterizing any source of randomness in the system, θ > 0
is a user-defined scalar parameter called the risk-sensitivity parameter, and J is an optimal control
cost. The risk-sensitivity parameter θ determines a relative weight between the expected cost and
other higher-order moments such as the variance [174]. Loosely speaking, the larger θ becomes, the
more the objective cares about the variance and is thus more risk-sensitive.
Our distributionally robust control algorithm can alternatively be viewed as an algorithm for
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 78
Figure 5.1: Model-based stochastic control methods often require a Gaussian noise assumption, suchas the one in the left that represents process noise in pedestrian motion under a collision avoidancescenario (see Section 5.5). However, the true stochastic model can be highly multi-modal and bettercaptured by a more complex distribution as shown in the right, which we may not exactly know.The proposed MPC effectively handles such a model mismatch without the knowledge of the truedistribution, except for a bound on the KL divergence between the two.
automatic online tuning of the risk-sensitivity parameter in applying risk-sensitive control. Risk-
sensitive optimal control has been shown effective and successful in many robotics applications
[99, 100, 11, 108]. However, in prior work (as well as in Chapter 4) the user has to specify a fixed
risk-sensitivity parameter offline. This would require an extensive trial and error process until a
desired robot behavior is observed. Furthermore, a risk-sensitivity parameter that works in a certain
state can be infeasible in another state, as we will see in Section 5.4. Ideally, the risk-sensitivity
should be adapted online depending on the situation to obtain a specifically desired robot behavior
[100, 108], yet this is highly nontrivial as no simple general relationship is known between the risk-
sensitivity parameter and the performance of the robot. Our algorithm addresses this challenge
as a secondary contribution. Due to the fundamental equivalence between distributionally robust
control and risk-sensitive control, it serves as a nonlinear risk-sensitive control that can dynamically
adjust the risk-sensitivity parameter depending on the state of the robot as well as the surrounding
environment.
The rest of the chapter is organized as follows. Section 5.2 reviews the related work in controls
and robotics literature. Section 5.3 summarizes the theoretical results originally presented in [117]
that connect distributionally robust control to risk-sensitive optimal control. Section 5.4 develops
this theory into an algorithm that provides a locally optimal solution for general nonlinear systems
with non-convex cost functions, which is a novel contribution of our work. In Section 5.5, we test our
method in a collision avoidance scenario wherein the predictive distribution of pedestrian motion is
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 79
erroneous. We further show its benefits as a risk-sensitive optimal controller that can automatically
adjust its risk-sensitivity parameter in this section. The chapter concludes in Section 5.6 with
potential future research directions.
5.2 Related Work
5.2.1 Distributional Robustness and Risk-Sensitivity
Distributionally robust control seeks to optimize control actions against a worst-case distribution
within a given set of probability distributions, often called the ambiguity set [162, 135]. There
exist various formulations to take account of distributional robustness in optimal control. Some
works are concerned with minimizing the worst-case expectation of a cost objective [135, 143], while
others enforce risk-based or chance constraint satisfaction under a worst-case distribution [59, 162].
The present work belongs to the former class. Existing methods also differ in the formulation of
the ambiguity set. Moment-based ambiguity sets require knowledge of moments of the ground-truth
distribution up to a finite order [162, 135], which is often overly conservative [59]. Statistical distance-
based ambiguity sets are also gaining attention. The authors of [59] use a Wasserstein metric to define
the ambiguity set for motion planning with collision avoidance, but their MPC formulation is not
suited for nonlinear systems. χ2-divergence and more general φ-divergences (which KL divergence
belongs to) are employed in [143], similar to the present work. However, the ambiguity set considered
in [143] is restricted to categorical distributions, while our work requires no assumption on the class
of the ground-truth distributions. Furthermore, we make use of risk-sensitive optimal control to
obtain planned robot trajectories with feedback, unlike sampling in their implementation.
Optimization of the entropic risk measure has been an active research topic in economics and
controls literature since 1970s [70, 173, 174, 47]. The concept of risk-sensitive optimal control has
been successfully applied to robotics in various domains, including haptic assistance [99, 100], model-
based reinforcement learning (RL) [11], and safe robot navigation [108, 169], to name a few. In all
these works, the risk-sensitivity parameter is introduced as a user-specified constant, and is found
to significantly affect the behavior of the robot. For instance, our study of safe robot navigation
in Chapter 4 reveals that a robot with higher risk-sensitivity tends to yield more to oncoming
human pedestrians. However, how to find a desirable risk-sensitivity parameter still remains an
open research question; in the robot navigation problem, the robot simply freezes if it is too risk-
sensitive when the scene is crowded. As the authors of [100] point out, the robot should adapt its
risk-sensitivity level depending on the situation, yet there still does not exist an effective algorithmic
framework to automate it due to the issues discussed in Section 5.1. In this work, we provide such
an algorithm for nonlinear, non-Gaussian stochastic systems. As mentioned earlier, our approach is
built on previously-established theoretical results that link risk-sensitive and distributionally robust
control [117].
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 80
5.2.2 Approximate Methods for Optimal Feedback Control
The theory of optimal control lets us derive an optimal feedback control law via dynamic pro-
gramming (DP) [13]. For linear systems with additive Gaussian white noise and quadratic cost
functions, the exact DP solution is tractable and is known as Linear-Quadratic-Gaussian (LQG) [7]
or Linear-Exponential-Quadratic-Gaussian (LEQG) [173]. They are different in that LQG optimizes
the expected cost while LEQG optimizes the entropic risk measure, although both DP recursions
are quite similar.
However, solving general optimal control problems for nonlinear systems remains a challenge
due to lack of analytical tractability. Hence, approximate local optimization methods have been
developed, including Differential Dynamic Programming (DDP) [71], iterative Linear-Quadratic
Regulator (iLQR) [91], and iterative Linear-Quadratic-Gaussian (iLQG) [156, 151]. While both DDP
and iLQR are designed for deterministic systems with quadratic cost functions, iLQG can locally
optimize the expected cost objective for Gaussian stochastic systems with non-convex cost functions.
Similarly, the iterative Linear-Exponential-Quadratic-Gaussian (iLEQG) has been recently proposed
to locally optimize the entropic risk for Gaussian systems with non-convex costs [48, 169, 131]. Note
however that they are not designed to be robust to model mismatches that we consider in this
chapter. In fact, it is known that even LQG does not possess guaranteed robustness [36].
5.3 Problem Statement
5.3.1 Distributionally Robust Optimal Control
Consider the following stochastic nonlinear system:
xk+1 = f(xk, uk) + g(xk, uk)wk, (5.2)
where xk ∈ Rn denotes the state, uk ∈ Rm the control, and wk ∈ Rr the noise input to the system
at time k. For some finite horizon N , let w0:N , (w0, . . . , wN ) denote the joint noise vector with
probability distribution q(w0:N ). This distribution is assumed to be a known Gaussian white noise
process, i.e. wi is independent of wj for all i 6= j, and we call (5.2) the reference system. Ideally,
we would like the model distribution q to perfectly characterize the noise in the dynamical system.
However, in reality the noise may come from a different, more complex distribution which we may
not know exactly. Let w0:N , (w0, . . . , wN ) denote a perturbed noise vector that is distributed
according to p(w0:N ). We define the following perturbed system that characterizes the true but
unknown dynamics:
xk+1 = f(xk, uk) + g(xk, uk)wk. (5.3)
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 81
Note that we make no assumptions on Gaussianity or whiteness of p. One could also attribute it
to potentially unmodeled dynamics. The true, unknown probability distribution p is contained in
the set P of all probability distributions on the support Rr(N+1). We assume that p is not “too
different” from q. This is encoded as the following constraint on the KL divergence between p and
q:
DKL(p‖q) ≤ d, (5.4)
where DKL(·‖·) is the KL divergence and d > 0 is a given constant. Note that DKL(p‖q) ≥ 0 always
holds, with equality if and only if p ≡ q. The set of all possible probability distributions p ∈ Psatisfying (5.4) is denoted by Ξ, which we define as our ambiguity set. This set is a convex subset
of P for a fixed q (Lemma 1.4.3, [40]).
We are interested in controlling the perturbed system (5.3) with a state feedback controller of
the form uk = K(k, xk). The operator K(k, ·) defines a mapping from Rn into Rm. The class of all
such controllers is denoted Λ.
The cost function considered in this chapter is given by
J(x0:N+1, u0:N ) ,N∑k=0
c(k, xk, uk) + h(xN+1), (5.5)
where c is the stage cost function and h is the terminal cost. We assume that the above objective
satisfies the following non-negativity assumption.
Assumption 5.1 (Assumption 3.1, [117]). The functions h(·) and c(k, ·, ·) satisfy h(x) ≥ 0 and
c(k, x, u) ≥ 0 for all k ∈ 0, . . . , N, x ∈ Rn, and u ∈ Rm.
Under the dynamics model (5.3), the cost model (5.5), and the KL divergence constraint (5.4) on
p, we are interested in finding an admissible controller K ∈ Λ that minimizes the worst-case expected
value of the cost objective (5.5). In other words, we are concerned with the following distributionally
robust optimal control problem:
infK∈Λ
supp∈Ξ
Ep [J(x0:N+1, u0:N )] , (5.6)
where Ep[·] indicates that the expectation is taken with respect to the true, unknown distribution
p. In this formulation, the robustness arises from the ability of the controller to plan against a
worst-case distribution p in the ambiguity set Ξ.
Remark 5.1. If the KL divergence bound d is zero, then p ≡ q is necessary. In this degenerate
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 82
case, (5.6) reduces to the standard stochastic optimal control problem:
infK∈Λ
Eq [J(x0:N+1, u0:N )] . (5.7)
5.3.2 Equivalent Risk-Sensitive Optimal Control
Unfortunately, the distributionally robust optimal control problem (5.6) is intractable as it involves
maximization with respect to a probability distribution p. To circumvent this, Petersen et al. [117]
prove that problem (5.6) is equivalent to a bilevel optimization problem involving risk-sensitive
optimal control with respect to the model distribution q. We refer the reader to [117] for the
derivation and only re-state the main results in this section for self-containedness. Before doing so,
we impose an additional assumption on the worst-case expected cost.
Assumption 5.2 (Assumption 3.2, [117]). For any admissible controller K ∈ Λ, the resulting
closed-loop system satisfies
supp∈P
Ep [J(x0:N+1, u0:N )] =∞. (5.8)
This assumption states that, without the KL divergence constraint, some adversarially-chosen
noise could make the expected cost objective arbitrarily large in the worst case. It amounts to a
controllability-type assumption with respect to the noise input and an observability-type assumption
with respect to the cost objective [117].
Under Assumptions 5.1 and 5.2, the following theorem holds.
Theorem 5.1. Consider the stochastic systems (5.2), (5.3) with the KL divergence constraint (5.4)
and the cost model (5.5). Under Assumptions 5.1 and 5.2, the following equivalence holds for the
distributionally robust optimal control problem (5.6):
infK∈Λ
supp∈Ξ
Ep [J(x0:N+1, u0:N )] = infτ∈Γ
infK∈Λ
τ logEq[exp
(J(x0:N+1, u0:N )
τ
)]+ τd, (5.9)
provided that the set
Γ ,
τ > 0 : inf
K∈Λτ logEq [exp(J/τ)] is finite
(5.10)
is non-empty.
Proof. See Theorems 3.1 and 3.2 in [117].
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 83
Remark 5.2. Notice that the first term in the right-hand side of (5.9) is the entropic risk measure
Rq, 1τ (J), where the risk is computed with respect to the model distribution q and τ > 0 serves as
the inverse of the risk-sensitivity parameter. Rewriting the equation in terms of the risk-sensitivity
parameter θ = 1/τ > 0, we see that the right-hand side of (5.9) is equivalent to
infθ∈Γ
(infK∈Λ
Rq,θ (J(x0:N+1, u0:N )) +d
θ
), (5.11)
where Γ , θ > 0 : infK∈ΛRq,θ(J) is finite. The non-emptiness of Γ (and equivalently, Γ) is sat-
isfied if there exists some non-zero risk-sensitivity θ that gives a finite entropic risk value. This is
almost always satisfied in practical situations where risk-sensitive optimal control can be applied,
as otherwise the problem would be ill-formed. Theorem 5.1 shows that the original distributionally
robust optimal control problem (5.6) is mathematically equivalent to a bilevel optimization prob-
lem (5.11) involving risk-sensitive optimal control. Note that the new problem does not involve any
optimization with respect to the unknown distribution p.
5.4 RAT iLQR Algorithm
Even though the mathematical equivalence shown in [117] and summarized in Section 5.3.2 is general,
it does not immediately lead to a tractable method to efficiently solve (5.11) for general nonlinear
systems. There are two major challenges to be addressed. First, exact optimization of the entropic
risk with a state feedback control law is intractable, except for linear systems with quadratic costs.
Second, the optimal risk-sensitivity parameter has to be searched efficiently over the feasible space
Γ, which not only is unknown but also varies dependent on the initial state x0. A novel contribution
of this chapter is a tractable algorithm that approximately solves both of the problems for general
nonlinear systems with non-convex cost functions. In what follows, we detail how we solve both the
inner and the outer loop of (5.11) to develop a distributionally-robust, risk-sensitive MPC.
Let us first consider the inner minimization of (5.11):
infK∈Λ
Rq,θ (J(x0:N+1, u0:N )) , (5.12)
where we omitted the extra term d/θ as it is constant with respect to the controller K. This amounts
to solving a risk-sensitive optimal control problem for a nonlinear Gaussian system. Recently, a
computationally-efficient, local optimization method called iterative Linear-Exponential-Quadratic-
Gaussian (iLEQG) has been proposed for both continuous-time systems [48] and the discrete-time
counterpart [131, 169]. Both versions locally optimize the entropic risk measure with respect to a
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 84
receding horizon, affine feedback control law for general nonlinear systems with non-convex costs.
We adopt a variant of the discrete-time iLEQG algorithm [169] to obtain a locally optimal solution
to (5.12). In what follows, we assume that the noise coefficient function g(xk, uk) in (5.2) is the
identity mapping for simplicity, but it is straightforward to handle nonlinear functions in a similar
manner as discussed in [156]. The algorithm starts by applying a given nominal control sequence
l0:N to the noiseless dynamics to obtain the corresponding nominal state trajectory x0:N+1. In each
iteration, the algorithm maintains and updates a locally optimal controller K of the form:
K(k, xk) = Lk(xk − xk) + lk, (5.13)
where Lk ∈ Rm×n denotes the feedback gain matrix. The i-th iteration of our iLEQG implementation
consists of the following four steps:
1. Local Approximation. Given the nominal trajectory l(i)0:N , x(i)0:N+1, we compute the follow-
ing linear approximation of the dynamics as well as the quadratic approximation of the cost
functions:
Ak = Dxf(x(i)k , l
(i)k ) (5.14)
Bk = Duf(x(i)k , l
(i)k ) (5.15)
qk = c(k, x(i)k , l
(i)k ) (5.16)
qk = Dxc(k, x(i)k , l
(i)k ) (5.17)
Qk = Dxxc(k, x(i)k , l
(i)k ) (5.18)
rk = Duc(k, x(i)k , l
(i)k ) (5.19)
Rk = Duuc(k, x(i)k , l
(i)k ) (5.20)
Pk = Duxc(k, x(i)k , l
(i)k ) (5.21)
for k = 0 to N , where D is the differentiation operator. We also let qN+1 = h(x(i)N+1),
qN+1 = Dxh(x(i)N+1), and QN+1 = Dxxh(x
(i)N+1).
2. Backward Pass. We perform approximate DP using the current feedback gain matrices
L(i)0:N as well as the approximated model obtained in the previous step. Suppose that the noise
vector wk is Gaussian-distributed according to N (0,Wk) with Wk 0. Let sN+1 , qN+1,
sN+1 , qN+1, and SN+1 , QN+1. Given these terminal conditions, we recursively compute
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 85
the following quantities:
Mk = W−1k − θSk+1 (5.22)
gk = rk +BTk (I + θSk+1M
−1k )sk+1 (5.23)
Gk = Pk +BTk (I + θSk+1M
−1k )Sk+1Ak (5.24)
Hk = Rk +BTk (I + θSk+1M
−1k )Sk+1Bk, (5.25)
and
sk = qk + sk+1 −1
2θlog det(I − θWkSk+1) +
θ
2sTk+1M
−1k sk+1 +
1
2l(i)Tk Hkl
(i)k + l
(i)Tk gk (5.26)
sk = qk +ATk (I + θSk+1M
−1k )sk+1 + L
(i)Tk Hkl
(i)k + L
(i)Tk gk +GT
k l(i)k (5.27)
Sk = Qk +ATk (I + θSk+1M
−1k )Sk+1Ak + L
(i)Tk HkL
(i)k + L
(i)Tk Gk +GT
kL(i)k , (5.28)
from k = N down to 0. Note that Mk 0 is necessary so it is invertible, which may not hold
if θ is too large. This is called “neurotic breakdown,” when the optimizer is so pessimistic that
the cost-to-go approximation becomes infinity [174]. Otherwise, the approximated cost-to-go
for this optimal control (under the controller L(i)0:N , l
(i)0:N) is given by s0.
3. Regularization and Control Computation. Having derived the DP solution, we compute
new control gains L(i+1)0:N and offset updates dl0:N as follows:
L(i+1)k = −(Hk + µI)−1Gk (5.29)
dlk = −(Hk + µI)−1gk, (5.30)
where µ ≥ 0 is a regularization parameter to prevent (Hk + µI) from having negative eigen-
values. We adaptively change µ across multiple iterations as suggested in [151], so the algo-
rithm enjoys fast convergence near a local minimum while ensuring the positive-definiteness
of (Hk + µI) at all times.
4. Line Search for Ensuring Convergence. It is known that the update could lead to in-
creased cost or even divergence if a new trajectory strays too far from the region where the
local approximation is valid [151]. Thus, the new nominal control trajectory l(i+1)0:N is computed
by backtracking line search with line search parameter ε. Initially, ε = 1 and we derive a new
candidate nominal trajectory as follows:
lk = L(i+1)k (xk − x(i)
k ) + l(i)k + εdlk (5.31)
xk+1 = f(xk, lk). (5.32)
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 86
If this candidate trajectory l0:N , x0:N+1 results in a lower cost-to-go than the current nom-
inal trajectory, then the candidate trajectory is accepted and returned as l(i+1)0:N , x
(i+1)0:N+1.
Otherwise, the trajectory is rejected and re-derived with ε ← ε/2 until it is accepted. More
details on this line search can be found in [161].
The above procedure is iterated until the nominal control lk does not change beyond some
threshold in a norm. Once converged, the algorithm returns the nominal trajectory l0:N , x0:N+1as well as the feedback gains L0:N and the approximate cost-to-go s0.
5.4.2 Cross-Entropy Method
Having implemented the iLEQG algorithm as a local approximation method1 for the inner-loop
optimization of (5.11), it remains to solve the outer-loop optimization for the optimal risk-sensitivity
parameter θ∗. This is a one-dimensional optimization problem in which the function evaluation is
done by solving the corresponding risk-sensitive optimal control (5.12). One might consider using a
simple grid search over a fixed interval specified a priori. Unfortunately, this may not work since the
feasible space Γ is dynamically changing due to re-planning; we empirically observed that the size
of Γ varied up to an order of magnitude in the collision avoidance scenario in Section 5.5. In this
work, we choose to adapt the cross entropy method [132, 80] to derive an approximately optimal
value for θ∗. This method is favorable for online optimization due to its any-time nature and high
parallelizability of the Monte Carlo sampling. As a byproduct, it is also effective in approximately
finding the maximum feasible θ ∈ Γ within a few iterations, which is detailed later in this section.
We note however that it is possible to use other methods for the outer-loop optimization as well.
The cross entropy method is a stochastic method that maintains an explicit probability distribution
over the design space. At each step, a set of ms Monte Carlo samples is drawn from the distribution,
out of which a subset of me “elite samples” that achieve the best performance is retained. The
parameters of the distribution is then updated according to the maximum likelihood estimate on
the elite samples. The algorithm stops after a desired number of steps M .
In our implementation we model the distribution2 as univariate Gaussian N (µ, σ2). A remaining
issue is that the iLEQG may return the cost-to-go of infinity if a sampled θ is too large, due to
neurotic breakdown. Since our search space is limited to Γ where θ yields a finite cost-to-go, we
have to ensure that each iteration has enough samples in Γ.
To address this problem, we augment the cross entropy method with rejection and re-sampling.
Out of the ms samples drawn from the univariate Gaussian, we first discard all non-positive samples.
1Although the solutions obtained by iLEQG are local optima and different for varied risk-sensitivity θ, we haveempirically observed that they are not changing drastically as a function of θ. This indicates that iLEQG is a sensiblechoice for the inner-loop optimization problem, since the solutions do not hop around local optima that are toodifferent as we vary θ.
2Note that this distribution is defined in the space of θ and has nothing to do with p or q that define stochasticnoise in the dynamical system.
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 87
For each of the remaining samples, we evaluate the objective (5.11) by a call to iLEQG, and then
count the number of samples that obtained a finite cost-to-go. Let mv be the number of such valid
samples. If mv ≥ max(me,ms/2), we proceed and fit the distribution. Otherwise, we redo the
sampling procedure as there are not sufficiently many valid samples to choose the elites from.
In practice, re-sampling is not likely to occur after the first iteration of the cross entropy method.
At the same time, we empirically found that the first iteration has a risk of re-sampling multi-
ple times, hence degrading the efficiency. We therefore also perform an adaptive initialization of
the Gaussian parameters µinit and σinit in the first iteration as follows. If the first iteration with
N (µinit, σ2init) results in re-sampling, we not only re-sample but also divide µinit and σinit by half. If
all of the ms samples are valid, on the other hand, we accept them but double µinit and σinit, since
it implies that the initial set of samples is not wide-spread enough to cover the whole feasible set
Γ. The parameters µinit and σinit are stored internally in the cross entropy solver and carried over
to the next call to the algorithm. We have empirically found that this adaptive initialization is also
useful for approximately finding the maximum feasible θ, which we exploited in a comparative study
in Section 5.5.3.
5.4.3 RAT iLQR as MPC
We name the proposed bilevel optimization algorithm RAT iLQR. The pseudo-code is given in
Algorithm 5.1. At run time, it is executed as an MPC in a receding-horizon fashion; the control
is re-computed after executing the first control input u0 = l0 and transitioning to a new state. A
previously-computed control trajectory l0:N is reused for the initial nominal control trajectory at
the next time step to warm-start the computation.
5.5 Results
This section presents qualitative and quantitative results of the simulation study that we conducted
to show the effectiveness of the RAT iLQR algorithm. We provide the problem setup as well as
implementation details in Section 5.5.1. The goals of this study are two-fold. First, we demonstrate
that the robot controlled by RAT iLQR can successfully accomplish its task under the presence of
stochastic disturbance, without access to the ground-truth distribution but the knowledge of the KL
divergence bound. This is presented in Section 5.5.2 with comparisons to (non-robust) iLQG and a
model-based MPC with sampling from the true generative model. Second, Section 5.5.3 focuses on
the nonlinear risk-sensitive optimal control aspect of RAT iLQR to show its value as an algorithm
that can optimally adjust the risk-sensitivity parameter online, which itself is a novel contribution.
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 88
Algorithm 5.1 RAT iLQR Algorithm
INPUT: Initial state x0, controls l0:N , L0:N (can be zero), KL divergence bound dOUTPUT: New nominal trajectory l0:N , x0:N+1, control gains L0:N , risk-sensitivity parameter
θ∗
1: Compute initial nominal trajectory x0:N+1 using l0:N
2: i← 13: while i ≤M do /* outer-loop optimization */4: while True do5: if i = 1 then6: θsampled ← drawSamples(ms, µinit, σinit)7: else8: θsampled ← drawSamples(ms, µ, σ)9: end if
10: array r . Empty array of size ms
11: for j ← 1 : ms do /* inner-loop optimization */12: Solve iLEQG with l0:N , x0:N+1, θsampled[j]13: Obtain approximate cost-to-go s0
14: r[j]← s0 + d/θsampled[j]15: end for16: mv ← countV alidSamples(θsampled, r)17: if i = 1 and mv < max(me,ms/2) then18: µinit ← µinit/2, σinit ← σinit/219: else if i = 1 and mv = ms then20: µinit ← 2µinit, σinit ← 2σinit
21: break22: else if mv ≥ max(me,ms/2) then23: break24: end if25: end while26: θelite ← selectElite(me, θsampled, r)27: µ, σ ← fitGaussian(θelite)28: i← i+ 129: end while30: θ∗ ← µ31: Solve iLEQG with l0:N , x0:N+1, θ
∗32: return new l0:N , x0:N+1 with L0:N and θ∗
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 89
5.5.1 Problem Setup
We consider a dynamic collision avoidance problem where a unicycle robot has to avoid a pedestrian
in a collision course as illustrated in Figure 5.2. Collision avoidance problems are often modeled by
stochastic optimal control in the autonomous systems literature [85] and the human-robot interaction
literature [137, 68], including our work presented in Chapter 4.
The state of the robot is defined by (rx, ry, v, θ) ∈ R4, where (rx, ry) m denotes the position,
v m s−1 the velocity, and θ rad the heading angle. The robot’s control input is u = (a, b) ∈R2, where a m s−2 is the acceleration and b rad s−1 is the angular velocity. The pedestrian
is modeled as a single integrator, whose position is given by (px, py) m. We assume that the
position of the pedestrian is known to the robot through onboard sensing. We also employ a
constant nominal velocity model (ux, uy) = (0.0, 1.0) m s−1 for the pedestrian. The joint sys-
tem x ∈ R6 consists of the state of the robot and the position of the pedestrian. The dy-
namics are propagated by Euler integration with time interval dt = 0.1 s and additive noise
wk to the joint state. The model distribution for wk is a zero-mean Gaussian N (0,W ) with
W = diag(1× 10−10, 1× 10−10, 1× 10−3, 1× 10−4, 2× 10−2, 2× 10−2) × dt. This covariance ma-
trix W is chosen to encode our modeling assumption that the pedestrian motion is the main source
of uncertainty in this joint system, and that the magnitude of the slip is almost negligible for the
robot. The ground-truth distribution for the robot is the same Gaussian as in the model, but the
pedestrian’s distribution is a mixture of Gaussians that is independent of the robot’s noise. Both the
model and the true distributions for the pedestrian are illustrated in Figure 5.1. Gaussian mixtures
are favored by many recent papers in machine learning to account for multi-modality in human’s
decision making [22, 165, 134].
RAT iLQR requires an upper-bound on the KL divergence between the model and the true
distribution. For the sake of this work we assume that there is a separate module that provides
an estimate. In this specific simulation study, we performed Monte Carlo integration with samples
drawn from the true distribution offline. During the simulation, however, we did not reveal any
information on the true distribution to RAT iLQR but the estimated KL value3 of 32.02. This
offline computation was possible due to our time-invariant assumption on the Gaussian mixture.
To use more realistic data-driven prediction instead (such as Trajectron++ discussed in Chapter
4), it would be necessary to estimate the KL divergence online since the predictive distribution
may change over time as the human-robot interaction evolves. Even though RAT iLQR works with
time-varying KL bounds owing to its MPC formulation, we limit our attention to a static KL bound
in this thesis as real-time computation of KL divergence can be challenging. Note that efficient
and accurate estimation of information measures (including KL divergence) is still an active area
of research in information theory and machine learning [12, 67], which is one of our future research
3One could test RAT iLQR under a distribution that has stronger multi-modality with a much larger KL boundthan the one used in this simulation study, but it could introduce over-conservatism and lead to poor mean performanceas the ambiguity set becomes too large. This is a common property of distributionally robust optimization [75].
where ctrack denotes a quadratic cost that penalizes the deviation from a given target robot trajectory,
ccoll a collision penalty that incurs high cost when the robot is too close to the pedestrian, and cctrl a
small quadratic cost on the control input. Mirroring the formulation in [169], we used the following
collision cost:
ccoll =10(
0.2√
(rx − px)2 + (ry − py)2 + 0.9)10 . (5.34)
RAT iLQR was implemented in Julia and the Monte Carlo sampling of the cross entropy method
was distributed across multiple CPU cores. Our implementation with N = 19, M = 5, ms = 10,
and me = 3 yielded the average computation time of 0.27 s. This is 2.7 times slower than real time,
with dt = 0.1 s used to measure the real-time property. We expect to achieve improved efficiency
by further parameter tuning as well as more careful parallelization.
5.5.2 Comparison with Baseline MPC Algorithms
We compared the performance of RAT iLQR against two baseline MPC algorithms, iLQG [156] and
PETS [31]. iLQG corresponds to RAT iLQR with the KL bound of d = 0, i.e. no distributional
robustness is considered. Instead it is more computationally efficient than RAT iLQR, taking only
0.01 s. PETS is a state-of-the-art, model-based stochastic MPC algorithm with sampling and is
originally proposed in a model-based reinforcement learning (RL) context. We chose PETS as our
baseline since it also relies on the cross entropy method for online control optimization and is not
limited to Gaussian distributions, similar to RAT iLQR. However, there are three major differences
between PETS and RAT iLQR. First, PETS performs the cross entropy optimization directly in the
high-dimensional control sequence space, which is far less sample efficient than RAT iLQR which
uses the cross entropy method to only optimize the scalar risk-sensitivity parameter. Second, PETS
does not consider feedback during planning as opposed to RAT iLQR. Third, PETS requires access
to the exact ground-truth Gaussian mixture distribution to perform sampling, while RAT iLQR
only relies on the KL divergence bound and the Gaussian distribution that we have modeled. We
let PETS perform M = 5 iterations of the cross entropy optimization, each with 25 samples for the
control sequence coupled with 50 samples for the joint state trajectory prediction, which resulted in
the average computation time of 0.67 s.
We performed 30 runs of the simulation for each algorithm, with randomized pedestrian start
positions and stochastic transitions. To measure the performance, we computed the minimum
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 91
Figure 5.2: A unicycle robot avoiding collision with a road-crossing pedestrian. (Left) When the KLbound is set to d = 0, RAT iLQR ignores this model error and reduces to iLQG. (Right) With thecorrect information on the KL, RAT iLQR is aware of the prediction error and optimally adjuststhe risk-sensitivity parameter for iLEQG, planning a trajectory that stays farther away from thepedestrian. The figures are overlaid with predictions drawn from the model distribution and closed-loop motion plans of the robot. Note that the prediction for the pedestrian is erroneous since theactual pedestrian motion follows the Gaussian mixture distribution. The model distribution and thetrue Gaussian mixture are both illustrated in Figure 5.1.
separation distance between the robot and the pedestrian in each run, assuming that the both
agents are circular with the same diameter. The histogram plots presented in Figure 5.3 clearly
indicates the failure of iLQG and PETS as well as RAT iLQR’s capability to maintain a sufficient
safety margin for collision avoidance despite the distributional model mismatch. As summarized
in Table 5.1, RAT iLQR achieved the largest minimum separation distance on average with the
smallest standard deviation, which contributed to safe robot navigation. Note that even iLQG had
one collision under this large model mismatch. Figure 5.2 provides a qualitative explanation of this
failure; the planned trajectories by iLQG tend to be much closer to the passing pedestrian than
those by the risk-sensitive RAT iLQR. This difference is congruous with our earlier observations in
Chapter 4 where risk-sensitivity is shown to affect the global behavior of the robot.
5.5.3 Benefits of Risk-Sensitivity Parameter Optimization
We also performed two additional sets of 30 simulations for RAT iLQR, with two different ground-
truth distributions that are closer to the model distribution having the estimated KL divergence of
7.78 and 1.34, respectively. This is to study how different KL bounds affect the optimal choice of the
risk-sensitivity parameter θ∗. The results are shown in Figure 5.4. As the KL bound increases from
1.34 to 32.02, the ratio between the optimal θ∗ found by RAT iLQR to the maximum feasible θ also
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 92
Figure 5.3: Histograms of the minimum separation distance between the robot and the pedestrian.A negative value indicates that a collision has occurred in that run. For each control algorithm,we performed 30 runs of the simulation with randomized pedestrian start positions. RAT iLQRconsistently maintained a sufficient safety margin to avoid collision, while iLQG and PETS bothfailed. See Table 5.1 for the summary statistics of these data.
increases. This matches our intuition that the larger the model mismatch, the more risk-sensitive
the robot becomes. However, we also note that the robot does not saturate θ all the time even under
the largest KL bound of 32.02. This raises a fundamental question on the benefits of RAT iLQR as
a risk-sensitive optimal control algorithm: how favorable is RAT iLQR with optimal θ∗ compared
to iLEQG with the highest risk-sensitivity?
To answer this question, we performed a comparative study between RAT iLQR with θ∗ and
iLEQG with θmax (i.e. maximum feasible θ found during the cross entropy sampling of RAT iLQR)
under the same simulation setup as before. The results are reported in Table 5.2. In terms of collision
avoidance, both algorithms were equally safe with collision count of 0. However, RAT iLQR achieved
significantly more efficient robot navigation compared to iLEQG with θmax, reducing the average
tracking error by 39%, 34%, and 16% for the KL values of 1.34, 7.78, and 32.02, respectively. The
efficiency and the safety of robot navigation are often in conflict in dynamic collision avoidance (see
Figure 5.5), and prior work [108] struggles to find the right balance by manually tuning a fixed θ.
With RAT iLQR, such need for manual tuning is eliminated since the algorithm dynamically adjusts
θ so it is the most desirable to handle the potential model mismatch specified by the KL bound.
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 93
Method Min. Sep. Dist. [m] Total Collision CountRAT iLQR (Ours) 1.26± 0.51 0iLQG 0.94± 0.62 1PETS 0.87± 0.69 4
Table 5.1: Statistics summarizing histogram plots presented in Figure 5.3. RAT iLQR achieved thelargest average value for the minimum separation distance with the smallest standard deviation,which contributed to safe robot navigation without a single collision. Note that PETS had multiplecollisions despite its access to the true Gaussian mixture distribution.
Figure 5.4: Time-averaged ratio of the optimal θ∗ found by RAT iLQR to the maximum feasible θbefore the neurotic breakdown occurs, plotted for three distinct KL divergence values. As the KLbound increases from 1.34 to 32.02, the ratio also consistently increased from 0.66 to 0.93. Notealso that the standard deviation decreased from 0.29 to 0.10. This suggests that the robot becomesmore risk-sensitive as the KL bound increases, and yet it does not choose the maximum θ value allthe time.
5.6 Conclusions
In this chapter we propose RAT iLQR, a novel nonlinear MPC algorithm for distributionally robust
control under a given KL divergence bound. Our method is based on the mathematical equiva-
lence between distributionally robust control and risk-sensitive optimal control. A locally optimal
solution to the resulting bilevel optimization problem is derived with iLEQG and the cross entropy
method. The simulation study shows that RAT iLQR successfully takes into account the distribu-
tional mismatch during collision avoidance. It also shows the effectiveness of dynamic adjustment of
the risk-sensitivity parameter by RAT iLQR, which overcomes a limitation of existing risk-sensitive
optimal control methods. Future work will focus on accurate online estimation of the KL divergence
from a stream of data. We are also interested in exploring applications of RAT iLQR, including
control of learned dynamical systems and perception-aware control.
CHAPTER 5. DISTRIBUTIONALLY ROBUST PLANNING 94
KL Bound: d = 1.34Method Total Collision Count Tracking Error [m]RAT iLQR (Ours) 0 0.22± 0.29iLEQG with θmax 0 0.36± 0.45
KL Bound: d = 7.78Method Total Collision Count Tracking Error [m]RAT iLQR (Ours) 0 0.25± 0.35iLEQG with θmax 0 0.38± 0.52
KL Bound: d = 32.02Method Total Collision Count Tracking Error [m]RAT iLQR (Ours) 0 0.32± 0.40iLEQG with θmax 0 0.38± 0.46
Table 5.2: Our comparative study between RAT iLQR with θ∗ and iLEQG with θmax (i.e. maximumfeasible risk-sensitivity) reveals that RAT iLQR’s optimal choice of the risk-sensitivity parameterθ∗ results in a more efficient robot navigation with smaller trajectory tracking errors, while stillachieving collision avoidance under the model mismatch. With RAT iLQR, the average trackingerror was reduced by 39%, 34%, and 16%, for 3 true distributions with different KL divergences of1.34, 7.78, and 32.02, respectively.
Figure 5.5: Trade-off between collision avoidance and trajectory tracking for different methods.Error bars show the standard deviations divided by the square root of the number of samples. Notethat the farther up and to the right, the better, as the x-axis is flipped. All the methods were testedunder the true distribution with KL divergence d = 32.02 from the Gaussian model. As expected,RAT iLQR lies between the risk-neutral iLQG and the maximally risk-sensitive iLEQG.
Chapter 6
Conclusions and Future Research
6.1 Summary of Contributions
In this thesis, we study the problem of online planning/control for intelligent mobile robots that
are operating autonomously in open and interactive environments. Challenges associated with en-
vironmental and dynamic uncertainty have been identified, and categorized based on the sources of
uncertainty as state uncertainty, transition uncertainty, and model uncertainty. We have taken an
online, model-based approach to propose computationally efficient solution methods that overcome
those uncertainties. State uncertainty is addressed via recursive Bayesian inference and belief space
planning, which together let the robot keep track of evolving state uncertainty and appropriately
mitigate it over time. Our online belief space planning frameworks are shown effective in various
applications with different combinations of dynamical systems and sensing modalities. Transition
uncertainty is handled by our nonlinear risk-sensitive MPC algorithm, whose ability to incorporate
modern generative models and consider the variance of uncertainty is advantageous in safety-critical
applications. This is demonstrated in safe autonomous navigation of a ground robot among many
mutually interacting humans. Finally, model uncertainty is addressed by our distributionally robust
MPC framework, which extends conventional risk-sensitive optimal control by dynamically and opti-
mally adjusting the risk-sensitivity parameter. This extension not only enables the robot to operate
without precise knowledge of the underlying probability distribution that characterizes transition
uncertainty, but also prevents the planner to become overly conservative; in simulation, the robot
controlled by our algorithm achieved more efficient navigation compared to typical risk-sensitive
optimal control, while not sacrificing safety. The methods developed in this thesis collectively make
the robotic system more reliable in open and interactive environments, by either mitigating, avoiding
the risk of, or becoming robust against inevitable uncertainty.
95
CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 96
6.2 Future Work
There exists several open questions for future research. The active vision-based intent inference
framework proposed in Chapter 2 is found to effectively resolve message ambiguity in motion-based
communication. However, the current formulation is limited in that it requires a set of hand-designed
trajectories represented by ordered collections of discrete waypoints. In order for us to apply our
active vision framework to general intent inference problems, it is of practical importance to be able
to identify how distinct intent of an agent leads to different statistical behaviors described by general
stochastic dynamical systems. This most likely requires domain specific, labeled data of the agent’s
motion, from which we need to perform regression to fit a stochastic dynamical system model that
corresponds to each intent type of the agent.
For fast belief space planning presented in Chapter 3, our method applies local perturbation to a
given nominal control policy, which requires that the system is differentiable. Thus our algorithm is
not suited for perception models that are not differentiable, such as a monocular camera with a field-
of-view limit. It is likely not effective either for problems that involve inherently discrete decisions
as part of the optimization variables. For the former issue, there exist recent works [76, 149] that
approximate non-differentiable processes with differentiable surrogate models. For the latter, we
could combine the local perturbation with a discrete search algorithm so that the search is followed
by the perturbation. This is indeed similar to the approach that we have taken in Chapter 4,
in which a discrete search method with motion primitives precedes control perturbation by our
optimization algorithm. Another interesting question is how to apply our belief space planning
algorithm to complex perception systems for which an analytical model is not available. To that
end, representation learning that encodes a series of high dimensional observations (such as a stream
of RGB images) into learned Bayesian posteriors can be useful. Learning hidden state representations
along with their belief dynamics is gaining attention in the machine learning literature [57, 55, 58],
and could be incorporated into the model-based belief space planning framework developed in this
thesis.
For risk-sensitive and distributionally robust planning presented in Chapter 4 and 5, it is worth
considering other risk metrics than the entropic risk measure employed in this thesis. In particular,
chance constrained optimization would be a beneficial framework for safety-critical applications,
since it can give a statistical safety assurance by limiting the probability of undesirable outcome.
However, chance constrained optimization is highly intractable for nonlinear systems under general
multimodal distributions, with a notable exception of recent work by Wang et al. [166]. Furthermore,
Cheng et al. [29] have suggested that modern data-driven methods for quantifying uncertainty are
too inaccurate for chance constrained optimization to the extent that the safety assurances are no
longer reliable. Therefore, the notion of distributional robustness is still necessary to take account
of model uncertainty. We think that distributionally robust, chance constrained optimization is an
interesting topic for future research.
CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH 97
We believe that these open research questions can eventually lead to more effective solution
methods that overcome fundamental issues associated with uncertainty. We hope that the methods
and the findings presented in this thesis will become the basis for future research towards truly
reliable robotic autonomy in open and interactive environments.
Appendix A
Mode Insertion Gradient for
Stochastic Hybrid Systems
In this appendix, we provide a thorough analysis of the stochastic hybrid systems with time-driven
switching that satisfy Assumptions 3.1 – 3.4. Our goal is to prove Theorem 3.1, which guarantees
that the mode insertion gradient exists and is well-defined. The results presented in this appendix
also appear in our journal paper [112] and serve as a generalization of the mode insertion gradient
presented in [41, 172, 6] to also take account of stochasticity in the dynamical system.
A.1 Nominal Trajectory under Specific Observations
First, we analyze the properties of the system x(t) for t ∈ [0, T ] under a given initial condition x0,
control u ∈ U , and a specific sequence of observations (y1, . . . , yT ) sampled from (Y1, . . . , YT ).
Proposition A.1 (Existence and Uniqueness of Solutions). Given a control u ∈ U and a sequence
of observations (y1, . . . , yT ), the system x(t) starting at x0 has a unique solution for t ∈ [0, T ].
Proof. We will show that each xi for i ∈ 1, . . . , T has a unique solution, and thus x is uniquely
determined as a whole. First, by Assumption (3.2a), (3.2c) and the Picard Lemma (Lemma 5.6.3 in
[43]), the differential equation
x1(t) = f(x1(t), u(t)) (A.1)
with initial condition x1(0) = x0 has a solution for t ∈ [0, 1]. Furthermore, Proposition 5.6.5 in [43]
assures that the solution x1 is unique under Assumption (3.2a) and (3.2c). This guarantees that the
initial condition for x2 defined by x2(1) = g(x1(1), y1) is unique. Therefore, proceeding by induction
each x1, . . . , xT has a unique solution, which completes the proof.
98
APPENDIX A. MODE INSERTION GRADIENT FOR STOCHASTIC HYBRID SYSTEMS 99
Corollary A.2 (Right Continuity). Given a control u ∈ U and a sequence of observations (y1, . . . , yT ),
the system x(t) starting at x0 is right continuous in t on [0, T ].
Proof. By Proposition A.1 each xi has a unique solution that follows xi = f(x, u). Clearly each
xi is continuous on [i − 1, i], which proves that x(t) , xi(t) ∀t ∈ [i − 1, i) ∀i ∈ 1, 2, . . . , T with
x(T ) , g(xT (T ), yT ) is right continuous on [0, T ].
Lemma A.3. Let ξi denote the initial condition for xi. Then, there exists a constant K8 <∞ such
that for all i ∈ 1, . . . , T,
∀t ∈ [i− 1, i] ‖xi(t)‖2 ≤ (1 + ‖ξi‖2) eK8 (A.2)
Proof. Using Assumption (3.2a) and (3.2c), the claim follows directly from Proposition 5.6.5 in
[43].
Proposition A.4 (Lipschitz Continuity). For each i ∈ 1, . . . , T, let ξ′i and ξ′′i be two distinct
initial conditions for xi. Furthermore, let u′ and u′′ be two controls from U . The pairs (ξ′, u′) and
(ξ′′, u′′) respectively define two solutions x′i and x′′i to ODE xi = f(xi, u) over [i− 1, i]. Then, there
exists an L <∞, independent of ξ′i, ξ′′i , u′ and u′′, such that
APPENDIX A. MODE INSERTION GRADIENT FOR STOCHASTIC HYBRID SYSTEMS 104
To bound the terminal cost, note that
|h(x(T ))| ≤ K6 +K7‖x(T )‖L32 (A.34)
= K6 +K7‖g(xT (T ), yT )‖L32 (A.35)
≤ K6 +K7
K2 +K4‖yT ‖L2
2 + ‖xT (T )‖L12
(K3 +K5‖yT ‖L2
2
)L3
(A.36)
by Assumptions (3.2d) and 3.3. Since L3 < ∞, (A.36) yields a polynomial of ‖xT (T )‖2 and ‖yT ‖2of finite terms, for each of which we can use Proposition A.7 and apply the multinomial expansion
formula to show
|h(x(T ))| ≤∑
(j1,...,jT )∈K′T+1
α′(j1,...,jT )T+1 (x0)
T∏m=1
‖ym‖jm2 , (A.37)
for some finite set K′T+1 of sequence of non-negative integers as well as finite positive constants
α′(j1,...,jT )T+1 (x0).
A.2 Perturbed Trajectory under Specific Observations
Next, we will perturb the nominal state x of the system while assuming the same initial condition x0
and the specific observations (y1, . . . , yT ) as in Section A.1. The perturbed control is an open-loop
perturbation as defined in Definition 3.1, which we restate below.
Definition A.1 (Perturbed Control). Let u ∈ U be a control. For τ ∈ (0, 1) and v ∈ B(0, ρmax),
define the perturbed control uε by
uε(t) ,
v if t ∈ (τ − ε, τ ]
u(t) otherwise,(A.38)
where ε ∈ [0, τ ]. By definition if ε = 0 then uε is the same as u. We assume that the nominal control
u(t) is left continuous in t at t = τ .
Remark A.1. It is obvious that uε(t) is piecewise continuous on [0, T ]. Therefore, for v ∈ B(0, ρmax)
we have uε ∈ U , i.e. uε is an admissible control. Thus, Proposition A.1 assures that there exists a
unique solution xε for the trajectory of the system under the control perturbation. In the remainder
of the analysis, we assume that (τ, v) is given and fixed.
Lemma A.9. Let ε, ε′ ∈ [0, τ ], and let uε, uε′ ∈ U be two perturbed controls of the form (A.38). Let
xε1, xε′
1 be the solutions of x1 for t ∈ [0, 1] by applying uε and uε′
respectively to the initial condition
APPENDIX A. MODE INSERTION GRADIENT FOR STOCHASTIC HYBRID SYSTEMS 105
x0. Then, there exists an L′ <∞, independent of ε, ε′, xε1 and xε′
1 , such that
∀ε, ε′ ∈ [0, τ ] ∀t ∈ [0, 1] ‖xε1(t)− xε′
1 (t)‖2 ≤ L′|ε− ε′|. (A.39)
Proof. By Proposition A.4, we find that
∀t ∈ [0, 1] ‖xε1(t)− xε′
1 (t)‖2 ≤ L∫ 1
0
‖uε(t)− uε′(t)‖2dt (A.40)
for some L <∞. Let us derive an upper-bound on the integral on the right hand side. If ε ≥ ε′,
∫ 1
0
‖uε(t)− uε′(t)‖2dt =
∫ τ−ε′
τ−ε‖v − u(t)‖2dt, (A.41)
where u(t) is the nominal control that both uε and uε′
are based on. Since u(t) ∈ B(0, ρmax), we
obtain ∫ 1
0
‖uε(t)− uε′(t)‖2dt ≤ sup
u∈B(0,ρmax)
‖v − u‖2(ε− ε′). (A.42)
Similarly, if ε < ε′ we have∫ 1
0
‖uε(t)− uε′(t)‖2dt ≤ sup
u∈B(0,ρmax)
‖v − u‖2(ε′ − ε). (A.43)
Put these two cases together and substitute into (A.40) to get
∀t ∈ [0, 1] ‖xε1(t)− xε′
1 (t)‖2 ≤ L′|ε− ε′|, (A.44)
where L′ , L supu∈B(0,ρmax) ‖v − u‖2 ≤ 2Lρmax <∞.
Lemma A.10. Let uε and xε1 be as in Lemma A.9. Then,
Lemma A.19. Given ε ∈ [0, τ), uε and (y1, . . . , yT ), the right derivative of the instantaneous cost
function with respect to ε is given by
∀t ∈ [i− 1, i]∂+
∂εc(xεi(t), u
ε(t)) =∂
∂xεic(xεi(t), u(t))TΨε
i(t). (A.142)
for each i ∈ 2, . . . , T.For i = 1 we have
∀t ∈ (τ, 1]∂+
∂εc(xε1(t), uε(t)) =
∂
∂xε1c(xε1(t), u(t))TΨε
1(t), (A.143)
Similarly, the right derivative of the terminal cost function with respect to ε is given by
∂+
∂εh(xε(T )) =
∂
∂xεh(xε(T ))TΨε(T ). (A.144)
Proof. To prove the claim for the instantaneous cost, note that uε(t) = u(t) for all t ∈ (τ, T ] and
use the chain rule. The case for the terminal cost also follows from the chain rule.
APPENDIX A. MODE INSERTION GRADIENT FOR STOCHASTIC HYBRID SYSTEMS 117
Proposition A.20 (Bounded Cost Variations). Given ε ∈ [0, τ), uε and (y1, . . . , yT ), the right
derivative of the instantaneous cost function with respect to ε has the following uniform bound:
∀t ∈ [i− 1, i]
∣∣∣∣∂+
∂εc(xεi(t), u
ε(t))
∣∣∣∣ ≤ ∑(j1,...,ji−1)∈L′i
β′(j1,...,ji−1)i (x0)
i−1∏m=1
‖ym‖jm2 (A.145)
for each i ∈ 2, . . . , T, where L′i is a finite set of sequences of nen-negative integers of length i− 1,
and β′(j1,...,ji−1)i (x0) is a finite positive constant that depends on x0 and (j1, . . . , ji−1) but not on ε,
uε, or (y1, . . . , yT ).
For i = 1 the bound is given by
∀t ∈ (τ, 1]
∣∣∣∣∂+
∂εc(xε1(t), uε(t))
∣∣∣∣ ≤ β′1(x0) (A.146)
for some finite positive constant β′1(x0).
Similarly, the right derivative of the terminal cost function with respect to ε has the following
bound: ∣∣∣∣∂+
∂εh(xε(T ))
∣∣∣∣ ≤ ∑(j1,...,jT )∈L′T+1
β′(j1,...,jT )T+1 (x0)
T∏m=1
‖ym‖jm2 (A.147)
for some finite set L′T+1 of sequence of non-negative integers as well as finite positive constants
β′(j1,...,jT )T+1 (x0).
Proof. The proof of this proposition is similar to that of Proposition A.8. Take any ε ∈ [0, τ). For
i ∈ 2, . . . , T, Assumption 3.3 along with Lemma A.19 yields∣∣∣∣∂+
∂εc(xεi(t), u
ε(t))
∣∣∣∣ =
∣∣∣∣ ∂∂xεi c(xεi(t), u(t))TΨεi(t)
∣∣∣∣ (A.148)
≤∥∥∥∥ ∂
∂xεic(xεi(t), u(t))
∥∥∥∥2
· ‖Ψεi(t)‖2 (A.149)
≤ (K6 +K7‖xεi(t)‖L32 ) ‖Ψε
i(t)‖2 . (A.150)
By Propositions A.7 and A.17, we have
‖xεi(t)‖2 ≤∑
(j1,...,ji−1)∈Ki
α(j1,...,ji−1)i (x0)
i−1∏m=1
‖ym‖jm2 (A.151)
‖Ψεi(t)‖2 ≤
∑(j1,...,ji−1)∈Li
β(j1,...,ji−1)i (x0)
i−1∏m=1
‖ym‖jm2 (A.152)
for all t ∈ [i − 1, i]. Substituting these into (A.150) and using the multinomial expansion formula,
APPENDIX A. MODE INSERTION GRADIENT FOR STOCHASTIC HYBRID SYSTEMS 118
we conclude that
∀i ∈ 2, . . . , T ∀t ∈ [i− 1, i]∣∣∣∣∂+
∂εc(xεi(t), u
ε(t))
∣∣∣∣ ≤ ∑(j1,...,ji−1)∈L′i
β′(j1,...,ji−1)i (x0)
i−1∏m=1
‖ym‖jm2 (A.153)
for some finite set L′i of sequences of non-negative integers as well as finite positive constants
β′(j1,...,ji−1)i (x0). Similarly, for i = 1 we have ∀t ∈ (τ, 1]∣∣∣∣∂+
∂εc(xε1(t), uε(t))
∣∣∣∣ ≤ (K6 +K7‖xε1(t)‖L32 )‖Ψε
1(t)‖2 (A.154)
≤ 2(K6 +K7α1(x0)L3)K1ρmaxeK1 (A.155)
, β′1(x0), (A.156)
by Proposition A.7 and Lemma A.16.
To bound the right derivative of the terminal cost, note that∣∣∣∣∂+
∂εh(xε(T ))
∣∣∣∣ =
∣∣∣∣ ∂∂xεh(xε(T ))TΨε(T )
∣∣∣∣ (A.157)
≤ (K6 +K7‖xε(T )‖L32 )‖Ψε(T )‖2 (A.158)
=(K6 +K7‖g(xεT (T ), yT )‖L3
2
)‖Ψε(T )‖2 (A.159)
≤(K6 +K7‖g(xεT (T ), yT )‖L3
2
)×∥∥∥∥ ∂
∂xεTg(xεT (T ), yT )
∥∥∥∥2
· ‖ΨεT (T )‖2 (A.160)
by Assumption 3.3, Proposition A.15, and Lemma A.19. One can apply Assumption (3.2d) to bound
the norms of g and its Jacobian in terms of ‖xεT (T )‖2 and ‖yT ‖. Then, (A.160) becomes a polynomial
of ‖xεT (T )‖2 and ‖yT ‖, multiplied by ‖ΨεT (T )‖2. Finally, using (A.151) and (A.152) with i = T to
replace ‖xεT (T )‖2 and ‖ΨεT (T )‖2, one can verify that
∣∣∣∣∂+
∂εh(xε(T ))
∣∣∣∣ ≤ ∑(j1,...,jT )∈L′T+1
β′(j1,...,jT )T+1 (x0)
T∏m=1
‖ym‖jm2 (A.161)
for some finite set L′T+1 of sequence of non-negative integers as well as finite positive constants
β′(j1,...,jT )T+1 (x0).
Lemma A.21. Let xε be the perturbed state induced by the perturbed control uε, and let (y1, . . . , yT )
be the given observations. Then, the function ε 7→ c(xε(t), uε(t)) is continuous with respect to
ε ∈ [0, τ ] for all t ∈ (τ, T ].
Proof. Note that for t ∈ (τ, T ] we have uε(t) = u(t). Thus, c(xε(t), uε(t)) = c(xε(t), u(t)). The
APPENDIX A. MODE INSERTION GRADIENT FOR STOCHASTIC HYBRID SYSTEMS 119
continuity of xε1(t) with respect to ε ∈ [0, τ ] follows from Lemma A.9. In particular, xε1(1) is
continuous with respect to ε. Next, suppose that ε 7→ xεi(i) is continuous for some i ∈ 1, . . . , T.Then, by Assumption (3.2b) and Corollary A.5 it follows that ε 7→ xεi+1(t) is continuous for all
t ∈ [i, i + 1]. Proceeding by mathematical induction, we conclude that xε(t) is continuous with
respect to ε ∈ [0, τ ] for all t ∈ (τ, T ]. Therefore, ε 7→ c(xε(t), u(t)) is continuous by Assumption
3.3.
Proposition A.22. Let u ∈ U be a control, which yields the nominal state x. Let xε be the perturbed
state induced by the perturbed control uε, and let (y1, . . . , yT ) be the given observations. Then, the