The Berkeley Artificial Intelligence Research BlogThe BAIR Blog
http://bair.berkeley.edu/blog/
FaSTrack: Ensuring Safe Real-Time Navigation of Dynamic Systems<div class="videoWrapper">
<iframe src="https://www.youtube.com/embed/KcJJOI2TYJA" frameborder="0" allowfullscreen=""></iframe>
</div>
<h1 id="the-problem-fast-and-safe-motion-planning">The Problem: Fast and Safe Motion Planning</h1>
<p>Real time autonomous motion planning and navigation is hard, especially when we
care about safety. This becomes even more difficult when we have systems with
complicated dynamics, external disturbances (like wind), and <em>a priori</em> unknown
environments. Our goal in this work is to “robustify” existing real-time motion
planners to guarantee safety during navigation of dynamic systems.</p>
<!--more-->
<p>In control theory there are techniques like <a href="http://ieeexplore.ieee.org/abstract/document/1463302/">Hamilton-Jacobi Reachability
Analysis</a> that provide rigorous safety guarantees of system behavior, along
with an optimal controller to reach a given goal (see Fig. 1). However, in
general the computational methods used in HJ Reachability Analysis are only
tractable in decomposable and/or low-dimensional systems; this is due to the
“curse of dimensionality.” That means for real time planning we can’t process
safe trajectories for systems of more than about two dimensions. Since most
real-world system models like cars, planes, and quadrotors have more than two
dimensions, these methods are usually intractable in real time.</p>
<p>On the other hand, geometric motion planners like rapidly-exploring random trees
(RRT) and model-predictive control (MPC) can plan in real time by using
simplified models of system dynamics and/or a short planning horizon. Although
this allows us to perform real time motion planning, the resulting trajectories
may be overly simplified, lead to unavoidable collisions, and may even be
dynamically infeasible (see Fig. 1). For example, imagine riding a bike and
following the path on the ground traced by a pedestrian. This path leads you
straight towards a tree and then takes a 90 degree turn away at the last second.
You can’t make such a sharp turn on your bike, and instead you end up crashing
into the tree. Classically, roboticists have mitigated this issue by pretending
obstacles are slightly larger than they really are during planning. This
greatly improves the chances of not crashing, but still doesn’t provide
guarantees and may lead to unanticipated collisions.</p>
<p>So how do we combine the speed of fast planning with the safety guarantee of
slow planning?</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure1.png" width="600" alt="fig1" />
<br />
<i>
Figure 1. On the left we have a high-dimensional vehicle moving through an
obstacle course to a goal. Computing the optimal safe trajectory is a slow and
sometimes intractable task, and replanning is nearly impossible. On the right
we simplify our model of the vehicle (in this case assuming it can move in
straight lines connected at points). This allows us to plan very quickly, but
when we execute the planned trajectory we may find that we cannot actually
follow the path exactly, and end up crashing.
</i>
</p>
<h1 id="the-solution-fastrack">The Solution: FaSTrack</h1>
<p>FaSTrack: Fast and Safe Tracking, is a tool that essentially “robustifies” fast
motion planners like RRT or MPC while maintaining real time performance.
FaSTrack allows users to implement a fast motion planner with simplified
dynamics while maintaining safety in the form of a <em>precomputed</em> bound on the
maximum possible distance between the planner’s state and the actual autonomous
system’s state at runtime. We call this distance the <em>tracking error bound</em>.
This precomputation also results in an optimal control lookup table which
provides the optimal error-feedback controller for the autonomous system to
pursue the online planner in real time.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure2.png" height="300" alt="fig2" />
<br />
<i>
Figure 2. The idea behind FaSTrack is to plan using the simplified model (blue),
but precompute a tracking error bound that captures all potential deviations of
the trajectory due to model mismatch and environmental disturbances like wind,
and an error-feedback controller to stay within this bound. We can then augment
our obstacles by the tracking error bound, which guarantees that our dynamic
system (red) remains safe. Augmenting obstacles is not a new idea in the
robotics community, but by using our tracking error bound we can take into
account system dynamics and disturbances.
</i>
</p>
<h2 id="offline-precomputation">Offline Precomputation</h2>
<p>We precompute this tracking error bound by viewing the problem as a
pursuit-evasion game between a planner and a tracker. The planner uses a
simplified model of the true autonomous system that is necessary for real time
planning; the tracker uses a more accurate model of the true autonomous system.
We assume that the tracker — the true autonomous system — is always pursuing
the planner. We want to know what the maximum relative distance (i.e. <em>maximum
tracking error</em>) could be in the worst case scenario: when the planner is
actively attempting to evade the tracker. If we have an upper limit on this
bound then we know the maximum tracking error that can occur at run time.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure3.png" width="500" alt="fig3" />
<br />
<i>
Figure 3. Tracking system with complicated model of true system dynamics
tracking a planning system that plans with a very simple model.
</i>
</p>
<p>Because we care about maximum tracking error, we care about maximum relative
distance. So to solve this pursuit-evasion game we must first determine the
relative dynamics between the two systems by fixing the planner at the origin
and determining the dynamics of the tracker relative to the planner. We then
specify a cost function as the distance to this origin, i.e. relative distance
of tracker to the planner, as seen in Fig. 4. The tracker will try to minimize
this cost, and the planner will try to maximize it. While evolving these
optimal trajectories over time, we capture the highest cost that occurs over the
time period. If the tracker can always eventually catch up to the planner, this
cost converges to a fixed cost for all time.</p>
<p>The smallest invariant level set of the converged value function provides
determines the tracking error bound, as seen in Fig. 5. Moreover, the gradient
of the converged value function can be used to create an optimal error-feedback
control policy for the tracker to pursue the planner. We used <a href="http://www.cs.ubc.ca/~mitchell/ToolboxLS/">Ian Mitchell’s
Level Set Toolbox</a> and Reachability Analysis to solve this differential
game. For a more thorough explanation of the optimization, please see <a href="https://arxiv.org/abs/1703.07373">our
recent paper from the 2017 IEEE Conference on Decision and Control</a>.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure4.gif" height="270" style="margin: 5px;" alt="gif4" />
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure5.gif" height="270" style="margin: 5px;" alt="gif5" />
<br />
<i>
Figures 4 & 5: On the left we show the value function initializing at the cost
function (distance to origin) and evolving according to the differential game.
On the right we should 3D and 2D slices of this value function. Each slice can
be thought of as a “candidate tracking error bound.” Over time, some of these
bounds become infeasible to stay within. The smallest invariant level set of the
converged value function provides us with the tightest tracking error bound that
is feasible.
</i>
</p>
<h2 id="online-real-time-planning">Online real time Planning</h2>
<p>In the online phase, we sense obstacles within a given sensing radius and
imagine expanding these obstacles by the tracking error bound with a Minkowski
sum. Using these padded obstacles, the motion planner decides its next desired
state. Based on that relative state between the tracker and planner, the
optimal control for the tracker (autonomous system) is determined from the
lookup table. The autonomous system executes the optimal control, and the
process repeats until the goal has been reached. This means that the motion
planner can continue to plan quickly, and by simply augmenting obstacles and
using a lookup table for control we can ensure safety!</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure6.gif" width="600" alt="gif6" />
<br />
<i>
Figure 6. MATLAB simulation of a 10D near-hover quadrotor model (blue line)
“pursuing” a 3D planning model (green dot) that is using RRT to plan. As new
obstacles are discovered (turning red), the RRT plans a new path towards the
goal. Based on the relative state between the planner and the autonomous system,
the optimal control can be found via look-up table. Even when the RRT planner
makes sudden turns, we are guaranteed to stay within the tracking error bound
(blue box).
</i>
</p>
<h1 id="reducing-conservativeness-through-meta-planning">Reducing Conservativeness through Meta-Planning</h1>
<p>One consequence of formulating the safe tracking problem as a pursuit-evasion
game between the planner and the tracker is that the resulting safe tracking
bound is often rather conservative. That is, the tracker can’t <em>guarantee</em> that
it will be close to the planner if the planner is always allowed to do the
<em>worst possible behavior</em>. One solution is to use multiple planning models, each
with its own tracking error bound, simultaneously at planning time. The
resulting “meta-plan” is comprised of trajectory segments computed by each
planner, each labelled with the appropriate optimal controller to track
trajectories generated by that planner. This is illustrated in Fig. 7, where the
large blue error bound corresponds to a planner which is allowed to move very
quickly and the small red bound corresponds to a planner which moves more
slowly.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure7.png" width="500" alt="fig7" />
<br />
<i>
Figure 7. By considering two different planners, each with a different tracking
error bound, our algorithm is able to find a guaranteed safe “meta-plan” that
prefers the less precise but faster-moving blue planner but reverts to the more
precise but slower red planner in the vicinity of obstacles. This leads to
natural, intuitive behavior that optimally trades off planner conservatism with
vehicle maneuvering speed.
</i>
</p>
<h2 id="safe-switching">Safe Switching</h2>
<p>The key to making this work is to ensure that all transitions between planners
are safe. This can get a little complicated, but the main idea is that a
transition between two planners — call them A and B — is safe if we can
guarantee that the invariant set computed for A is contained within that for B.
For many pairs of planners this is true, e.g. switching from the blue bound to
the red bound in Fig. 7. But often it is not. In general, we need to solve a
dynamic game very similar to the original one in FaSTrack, but where we want to
know the set of states that we will never leave and from which we can guarantee
we end up inside B’s invariant set. Usually, the resulting <em>safe switching
bound</em> (SSB) is slightly larger than A’s tracking error bound (TEB), as shown
below.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure8_v2.png" width="500" alt="fig8" />
<br />
<i>
Figure 8. The safe switching bound for a transition between a planner with a
large tracking error bound to one with a small tracking error bound is generally
larger than the large tracking error bound, as shown.
</i>
</p>
<h2 id="efficient-online-meta-planning">Efficient Online Meta-Planning</h2>
<p>To do this efficiently in real time, we use a modified version of the classical
RRT algorithm. Usually, RRTs work by sampling points in state space and
connecting them with line segments to form a tree rooted at the start point. In
our case, we replace the line segments with the actual trajectories generated by
individual planners. In order to find the shortest route to the goal, we favor
planners that can move more quickly, trying them first and only resorting to
slower-moving planners if the faster ones fail.</p>
<p>We do have to be careful to ensure safe switching bounds are satisfied, however.
This is especially important in cases where the meta-planner decides to
transition to a more precise, slower-moving planner, as in the example above. In
such cases, we implement a one-step virtual backtracking algorithm in which we
make sure the preceding trajectory segment is collision-free using the switching
controller.</p>
<h1 id="implementation">Implementation</h1>
<p>We implemented both FaSTrack and Meta-Planning in C++ / ROS, using low-level
motion planners from the Open Motion Planning Library (OMPL). Simulated results
are shown below, with (right) and without (left) our optimal controller. As you
can see, simply using a linear feedback (LQR) controller (left) provides no
guarantees about staying inside the tracking error bound.</p>
<p style="text-align:center;">
<img src="http://people.eecs.berkeley.edu/~dfk/lqr_video.gif" height="220" style="margin: 5px;" alt="fig09" />
<img src="http://people.eecs.berkeley.edu/~dfk/opt_video.gif" height="220" style="margin: 5px;" alt="fig10" />
<br />
<i>
Figures 9 & 10. (Left) A standard LQR controller is unable to keep the quadrotor
within the tracking error bound. (Right) The optimal tracking controller keeps
the quadrotor within the tracking bound, even during radical changes in the
planned trajectory.
</i>
</p>
<p>It also works on hardware! We tested on the open-source Crazyflie 2.0 quadrotor
platform. As you can see in Fig. 12, we manage to stay inside the tracking bound
at all times, even when switching planners.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure11.png" height="250" style="margin: 5px;" alt="f11" />
<img src="http://bair.berkeley.edu/blog/assets/fastrack/Figure12.png" height="250" style="margin: 5px;" alt="f12" />
<br />
<i>
Figures 11 & 12. (Left) A Crazyflie 2.0 quadrotor being observed by an OptiTrack
motion capture system. (Right) Position traces from a hardware test of the meta
planning algorithm. As shown, the tracking system stays within the tracking
error bound at all times, even during the planner switch that occurs
approximately 4.5 seconds after the start.
</i>
</p>
<p>This post is based on the following papers:</p>
<ul>
<li>
<p><strong>FaSTrack: a Modular Framework for Fast and Guaranteed Safe Motion Planning</strong><br />
Sylvia Herbert*, Mo Chen*, SooJean Han, Somil Bansal, Jaime F. Fisac, and Claire J. Tomlin <br />
<a href="https://arxiv.org/abs/1703.07373">Paper</a>, <a href="http://sylviaherbert.com/fastrack/">Website</a></p>
</li>
<li>
<p><strong>Planning, Fast and Slow: A Framework for Adaptive Real-Time Safe Trajectory Planning</strong><br />
David Fridovich-Keil*, Sylvia Herbert*, Jaime F. Fisac*, Sampada Deglurkar, and Claire J. Tomlin<br />
<a href="https://arxiv.org/abs/1710.04731">Paper</a>, <a href="https://github.com/HJReachability">Github</a> (code to appear soon)</p>
</li>
</ul>
<p>We would like to thank our coauthors; developing FaSTrack has been a team effort
and we are incredibly fortunate to have a fantastic set of colleagues on this
project.</p>
Tue, 05 Dec 2017 01:00:00 -0800
http://bair.berkeley.edu/blog/2017/12/05/fastrack/
http://bair.berkeley.edu/blog/2017/12/05/fastrack/Model-based Reinforcement Learning with Neural Network Dynamics<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_1a.png" height="240" style="margin: 10px;" alt="fig1a" />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_1b.gif" height="240" style="margin: 10px;" alt="fig1b" />
<br />
<i>
Fig 1. A learned neural network dynamics model enables a hexapod robot to learn
to run and follow desired trajectories, using just 17 minutes of real-world
experience.
</i>
</p>
<p>Enabling robots to act autonomously in the real-world is difficult. <a href="https://www.youtube.com/watch?v=g0TaYhjpOfo">Really,
really difficult</a>. Even with expensive robots and teams of world-class
researchers, robots still have difficulty autonomously navigating and
interacting in complex, unstructured environments.</p>
<p>Why are autonomous robots not out in the world among us? Engineering systems
that can cope with all the complexities of our world is hard. From nonlinear
dynamics and partial observability to unpredictable terrain and sensor
malfunctions, robots are particularly susceptible to Murphy’s law: everything
that can go wrong, will go wrong. Instead of fighting Murphy’s law by coding
each possible scenario that our robots may encounter, we could instead choose to
embrace this possibility for failure, and enable our robots to learn from it.
Learning control strategies from experience is advantageous because, unlike
hand-engineered controllers, learned controllers can adapt and improve with more
data. Therefore, when presented with a scenario in which everything does go
wrong, although the robot will still fail, the learned controller will hopefully
correct its mistake the next time it is presented with a similar scenario. In
order to deal with complexities of tasks in the real world, current
learning-based methods often use deep neural networks, which are powerful but
not data efficient: These trial-and-error based learners will most often still
fail a second time, and a third time, and often thousands to millions of times.
The sample inefficiency of modern deep reinforcement learning methods is one of
the main bottlenecks to leveraging learning-based methods in the real-world.</p>
<p>We have been investigating sample-efficient learning-based approaches with
neural networks for robot control. For complex and contact-rich simulated
robots, as well as real-world robots (Fig. 1), our approach is able to learn
locomotion skills of trajectory-following using only minutes of data collected
from the robot randomly acting in the environment. In this blog post, we’ll
provide an overview of our approach and results. More details can be found in
our research papers listed at the bottom of this post, including <a href="https://arxiv.org/pdf/1708.02596.pdf">this paper</a>
with <a href="https://github.com/nagaban2/nn_dynamics">code here</a>.</p>
<!--more-->
<h2 id="sample-efficiency-model-free-versus-model-based">Sample efficiency: model-free versus model-based</h2>
<p>Learning robotic skills from experience typically falls under the umbrella of
reinforcement learning. Reinforcement learning algorithms can generally be
divided into two categories: model-free, which learn a policy or value function, and
model-based, which learn a dynamics model. While model-free deep reinforcement
learning algorithms are capable of learning a wide range of robotic skills, they
typically suffer from <a href="http://www.nature.com/nature/journal/v518/n7540/pdf/nature14236.pdf">very</a> <a href="https://people.eecs.berkeley.edu/~pabbeel/papers/2015-ICML-TRPO.pdf">high</a> <a href="https://arxiv.org/pdf/1611.02247.pdf">sample</a> <a href="https://web.eecs.umich.edu/~baveja/Papers/ICML2016.pdf">complexity</a>, often
requiring millions of samples to achieve good performance, and can typically
only learn a single task at a time. Although some prior work has deployed these
model-free algorithms for <a href="https://arxiv.org/pdf/1610.00633.pdf">real-world manipulation tasks</a>, the high sample
complexity and inflexibility of these algorithms has hindered them from being
widely used to learn locomotion skills in the real world.</p>
<p>Model-based reinforcement learning algorithms are generally regarded as being
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.436.44&rep=rep1&type=pdf">more sample efficient</a>. However, to achieve good sample efficiency, these
model-based algorithms have conventionally used either relatively simple
<a href="http://papers.nips.cc/paper/5444-learning-neural-network-policies-with-guided-policy-search-under-unknown-dynamics.pdf">function</a> <a href="http://ieeexplore.ieee.org/document/6907424/">approximators</a>, which fail to generalize well to complex
tasks, or probabilistic dynamics models such as <a href="http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf">Gaussian</a> <a href="http://ieeexplore.ieee.org/document/7010608/">processes</a>,
which generalize well but have difficulty with complex and high-dimensional
domains, such as systems with frictional contacts that induce discontinuous
dynamics. Instead, we use medium-sized neural networks to serve as function
approximators that can achieve excellent sample efficiency, while still being
expressive enough for generalization and application to various complex and
high-dimensional locomotion tasks.</p>
<h2 id="neural-network-dynamics-for-model-based-deep-reinforcement-learning">Neural Network Dynamics for Model-Based Deep Reinforcement Learning</h2>
<p>In our work, we aim to extend the successes that deep neural network models have
seen in other domains into model-based reinforcement learning. Prior efforts to
combine neural networks with model-based RL in recent years have not achieved
the kinds of results that are competitive with simpler models, such as <a href="http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf">Gaussian
processes</a>. For example, <a href="https://arxiv.org/pdf/1603.00748.pdf">Gu et. al.</a> observed that even linear models
achieved better performance for synthetic experience generation, while <a href="https://arxiv.org/pdf/1510.09142.pdf">Heess
et. al.</a> saw relatively modest gains from including neural network models
into a model-free learning system. Our approach relies on a few crucial
decisions. First, we use the learned neural network model within a model
predictive control framework, in which the system can iteratively replan and
correct its mistakes. Second, we use a relatively short horizon look-ahead so
that we do not have to rely on the model to make very accurate predictions far
into the future. These two relatively simple design decisions enable our method
to perform a wide variety of locomotion tasks that have not previously been
demonstrated with general-purpose model-based reinforcement learning methods
that operate directly on raw state observations.</p>
<p>A diagram of our model-based reinforcement learning approach is shown in Fig. 2.
We maintain a dataset of trajectories that we iteratively add to, and we use
this dataset to train our dynamics model. The dataset is initialized with random
trajectories. We then perform reinforcement learning by alternating between
training a neural network dynamics model using the dataset, and using a model
predictive controller (MPC) with our learned dynamics model to gather additional
trajectories to aggregate onto the dataset. We discuss these two components
below.</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_2.png" width="600" alt="fig2" />
<br />
<i>
Fig 2. Overview of our model-based reinforcement learning algorithm.
</i>
</p>
<h3 id="dynamics-model">Dynamics Model</h3>
<p>We parameterize our learned dynamics function as a deep neural network,
parameterized by some weights that need to be learned. Our dynamics function
takes as input the current state $s_t$ and action $a_t$, and outputs the
predicted state difference $s_{t+1}-s_t$. The dynamics model itself can be
trained in a supervised learning setting, where collected training data comes in
pairs of inputs $(s_t,a_t)$ and corresponding output labels $(s_{t+1},s_t)$.</p>
<p>Note that the “state” that we refer to above can vary with the agent, and it can
include elements such as center of mass position, center of mass velocity, joint
positions, and other measurable quantities that we choose to include.</p>
<h3 id="controller">Controller</h3>
<p>In order to use the learned dynamics model to accomplish a task, we need to
define a reward function that encodes the task. For example, a standard “x_vel”
reward could encode a task of moving forward. For the task of trajectory
following, we formulate a reward function that incentivizes staying close to the
trajectory as well as making forward progress along the trajectory.</p>
<p>Using the learned dynamics model and task reward function, we formulate a
model-based controller. At each time step, the agent plans $H$ steps into the
future by randomly generating $K$ candidate action sequences, using the learned
dynamics model to predict the outcome of those action sequences, and selecting
the sequence corresponding to the highest cumulative reward (Fig. 3). We then
execute only the first action from the action sequence, and then repeat the
planning process at the next time step. This replanning makes the approach
robust to inaccuracies in the learned dynamics model.</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_3.png" width="500" alt="fig3" />
<br />
<i>
Fig 3. Illustration of the process of simulating multiple candidate action
sequences using the learned dynamics model, predicting their outcome, and
selecting the best one according to the reward function.
</i>
</p>
<h2 id="results">Results</h2>
<p>We first evaluated our approach on a variety of MuJoCo agents, including the
swimmer, half-cheetah, and ant. Fig. 4 shows that using our learned dynamics
model and MPC controller, the agents were able to follow paths defined by a set
of sparse waypoints. Furthermore, our approach used only <em>minutes</em> of random
data to train the learned dynamics model, showing its sample efficiency.</p>
<p>Note that with this method, we trained the model only once, but simply by
changing the reward function, we were able to apply the model at runtime to a
variety of different desired trajectories, without a need for separate
task-specific training.</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_4a.gif" height="140" style="margin: 6px;" alt="fig4a" />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_4b.gif" height="140" style="margin: 6px;" alt="fig4b" />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_4c.gif" height="140" style="margin: 6px;" alt="fig4c" /> <br />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_4d.gif" height="140" style="margin: 6px;" alt="fig4d" />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_4e.gif" height="140" style="margin: 6px;" alt="fig4e" />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_4f.gif" height="140" style="margin: 6px;" alt="fig4f" />
<br />
<i>
Fig 4: Trajectory following results with ant, swimmer, and half-cheetah. The
dynamics model used by each agent in order to perform these various trajectories
was trained just once, using only randomly collected training data.
</i>
</p>
<p>What aspects of our approach were important to achieve good performance? We
first looked at the effect of varying the MPC planning horizon H. Fig. 5 shows
that performance suffers if the horizon is too short, possibly due to
unrecoverable greedy behavior. For half-cheetah, performance also suffers if the
horizon is too long, due to inaccuracies in the learned dynamics model. Fig. 6
illustrates our learned dynamics model for a single 100-step prediction, showing
that open-loop predictions for certain state elements eventually diverge from
the ground truth. Therefore, an intermediate planning horizon is best to avoid
greedy behavior while minimizing the detrimental effects of an inaccurate model.</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_5.png" alt="fig5" />
<br />
<i>
Fig 5: Plot of task performance achieved by controllers using different horizon
values for planning. Too low of a horizon is not good, and neither is too high
of a horizon.
</i>
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_6.png" width="600" alt="fig6" />
<br />
<i>
Fig 6: A 100-step forward simulation (open-loop) of the dynamics model, showing
that open-loop predictions for certain state elements eventually diverge from
the ground truth.
</i>
</p>
<p>We also varied the number of initial random trajectories used to train the
dynamics model. Fig. 7 shows that although a higher amount of initial training
data leads to higher initial performance, data aggregation allows even low-data
initialization experiment runs to reach a high final performance level. This
highlights how on-policy data from reinforcement learning can improve sample
efficiency.</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_7.png" alt="fig7" />
<br />
<i>
Fig 7: Plot of task performance achieved by dynamics models that were trained
using differing amounts of initial random data.
</i>
</p>
<p>It is worth noting that the final performance of the model-based controller is
still substantially lower than that of a very good model-free learner (when the
model-free learner is trained with thousands of times more experience). This
suboptimal performance is sometimes referred to as “model bias,” and is a known
issue in model-based RL. To address this issue, we also proposed a hybrid
approach that combines model-based and model-free learning to eliminate the
asymptotic bias at convergence, though at the cost of additional experience.
This hybrid approach, as well as additional analyses, are available in our
paper.</p>
<h2 id="learning-to-run-in-the-real-world">Learning to run in the real world</h2>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_8.png" width="400" alt="fig8" />
<br />
<i>
Fig 8: The VelociRoACH is 10 cm in length, approximately 30 grams in weight, can
move up to 27 body-lengths per second, and uses two motors to control all six
legs.
</i>
</p>
<p>Since our model-based reinforcement learning algorithm can learn locomotion
gaits using orders of magnitude less experience than model-free algorithms, it
is possible to evaluate it directly on a real-world robotic platform. In other
work, we studied how this method can learn entirely from real-world experience,
acquiring locomotion gaits for a millirobot (Fig. 8) completely from scratch.</p>
<p>Millirobots are a promising robotic platform for many applications due to their
small size and low manufacturing costs. However, controlling these millirobots
is difficult due to their underactuation, power constraints, and size. While
hand-engineered controllers can sometimes control these millirobots, they often
have difficulties with dynamic maneuvers and complex terrains. We therefore
leveraged our model-based learning technique from above to enable the
VelociRoACH millirobot to do trajectory following. Fig. 9 shows that our
model-based controller can accurately follow trajectories at high speeds, after
having been trained using only 17 minutes of random data.</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_9a.gif" height="200" style="margin: 10px;" alt="fig9a" />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_9b.gif" height="200" style="margin: 10px;" alt="fig9b" /> <br />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_9c.gif" height="200" style="margin: 10px;" alt="fig9c" />
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/fig_9d.gif" height="200" style="margin: 10px;" alt="fig9d" />
<br />
<i>
Fig 9: The VelociRoACH following various desired trajectories, using our
model-based learning approach.
</i>
</p>
<p>To analyze the model’s generalization capabilities, we gathered data on both
carpet and styrofoam terrain, and we evaluated our approach as shown in Table 1.
As expected, the model-based controller performs best when executed on the same
terrain that it was trained on, indicating that the model incorporates knowledge
of the terrain. However, performance diminishes when the model is trained on
data gathered from both terrains, which likely indicates that more work is
needed to develop algorithms for learning models that are effective across
various task settings. Promisingly, Table 2 shows that performance increases as
more data is used to train the dynamics model, which is an encouraging
indication that our approach will continue to improve over time (unlike
hand-engineered solutions).</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/table_1.png" width="600" alt="table1" />
<br />
<i>
Table 1: Trajectory following costs incurred for models trained with different
types of data and for trajectories executed on different surfaces.
</i>
</p>
<p style="text-align:center;">
<img src="https://people.eecs.berkeley.edu/~nagaban2/misc/bair_blog_figs/table_2.png" width="600" alt="table2" />
<br />
<i>
Table 2: Trajectory following costs incurred during the use of dynamics
models trained with differing amounts of data.
legs.
</i>
</p>
<p>We hope that these results show the promise of model-based approaches for
sample-efficient robot learning and encourage future research in this area.</p>
<hr />
<p>We would like to thank Sergey Levine and Ronald Fearing for their feedback.</p>
<p>This post is based on the following papers:</p>
<ul>
<li>
<p><strong>Neural Network Dynamics Models for Control of Under-actuated Legged Millirobots</strong> <br />
A Nagabandi, G Yang, T Asmar, G Kahn, S Levine, R Fearing <br />
<a href="https://arxiv.org/abs/1711.05253">Paper</a></p>
</li>
<li>
<p><strong>Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning</strong> <br />
A Nagabandi, G Kahn, R Fearing, S Levine <br />
<a href="https://arxiv.org/abs/1708.02596">Paper</a>, <a href="https://sites.google.com/view/mbmf">Website</a>, <a href="https://github.com/nagaban2/nn_dynamics">Code</a></p>
</li>
</ul>
Thu, 30 Nov 2017 01:00:00 -0800
http://bair.berkeley.edu/blog/2017/11/30/model-based-rl/
http://bair.berkeley.edu/blog/2017/11/30/model-based-rl/The Emergence of a Fovea while Learning to Attend<h2 id="why-we-need-attention">Why we need Attention</h2>
<p>What we see through our eyes is only a very small part of the world around us. At any given time our eyes are sampling only a fraction of the surrounding light field. Even within this fraction, most of the resolution is dedicated to the center of gaze which has the highest concentration of <em>ganglion cells</em>. These cells are responsible for conveying a retinal image from our eyes to our brain. Unlike a camera, the spatial distribution of ganglion cells is highly non-uniform. As a result, our brain receives a <em>foveated</em> image:</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/bee.png" width="500" />
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/butterfly.png" width="500" />
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
A foveated image with a center of gaze on the bee (left) and butterfly (right)
(<a href="https://en.wikipedia.org/wiki/Foveated_imaging">source</a>).
</i>
</p>
<p>Despite the fact that these cells cover only a fraction of our visual field, roughly 30% of our cortex is still dedicated to processing the signal that they provide. You can imagine our brain would have to be impractically large to handle the full visual field at high resolution. Suffice it to say, the amount of neural processing dedicated to vision is rather large and it would be beneficial to survival if it were used efficiently.</p>
<p><em>Attention</em> is a fundamental property of many intelligent systems. Since the resources of any physical system are limited, it is important to allocate them in an effective manner. Attention involves the dynamic allocation of information processing resources to best accomplish a specific task. In nature, we find this very apparent in the design of animal visual systems. By moving gaze rapidly within the scene, limited neural resources are effectively spread over the entire visual scene.</p>
<!--more-->
<h2 id="overt-attention">Overt Attention</h2>
<p>In this work, we study <em>overt</em> attention mechanisms which involve the explicit movement of the sensory organ. An example of this form of attention can be seen in the adolescent jumping spider:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/spider.gif" /><br />
<i>
An adolescent jumping spider using overt attention.
</i>
</p>
<p>We can see the spider is attending to different parts of its environment by making careful, deliberate movements of its body. When peering through its translucent head, you can even see the spider moving its eye stalks in a similar manner to how humans move their own eyes. These eye movements are called <em>saccades</em>.</p>
<p>In this work, we build a model visual system that must make saccades over a scene in order to find and recognize an object. This model allows us to study the properties of an attentional system by exploring the design parameters that optimize performance. One parameter of interest in visual neuroscience is the <em>retinal sampling lattice</em> which defines the relative positions of the array of ganglion cells in our eyes.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/translate.gif" width="500" />
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/nnmodel.png" width="500" />
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
(Left) Our model retinal sampling lattice attending to different parts of a simple scene. (Right) Our neural network model which controls the window of attention.
</i>
</p>
<h2 id="approximating-evolution-through-gradient-descent">Approximating Evolution through Gradient Descent</h2>
<p>Evolutionary pressure has presumably tuned the retinal sampling lattice in the primate retina to be optimal for visual search tasks faced by the animal. In lieu of simulating evolution, we utilize a more efficient <em>stochastic gradient descent</em> procedure for our in-silico model by constructing a fully differentiable dynamic model of attention.</p>
<p>Most neural networks are composed of learnable feature extractors which transform a fixed input to a more abstract representation such as a category. While the internal features (i.e. weight matrices and kernel filters) are learned during training, the geometry of the input remains fixed. We extend the deep learning framework to create learnable <em>structural features</em>. We learn the geometry of the neural sampling lattice in the retina.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/structure.png" /><br />
<i>
Structural features of one cell in the lattice.
</i>
</p>
<p>The retinal sampling lattice of our model is learned via backpropagation. Similar to the way weights are adjusted in a neural network, we adjust the parameters of the retinal tiling to optimize a loss function. We initialize the retinal sampling lattice to a regular square grid and update the parameterization of this layout using gradient descent.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/learning.png" width="500" /><br />
<i>
Learning structural features from initialization using gradient descent.
</i>
</p>
<p>Over time, this layout will converge to a configuration which is locally optimal to minimize the task loss. In our case, we classify of the MNIST digit in a larger visual scene. Below we see how the retinal layout changes during training:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/by4.png" /><br />
<i>
Retinal sampling lattice during training at initialization, 1, 10, 100 epochs respectively.
</i>
</p>
<p>Surprisingly, the cells change in a very structured manner, smoothly transforming from a uniform grid to an eccentricity dependent lattice. We notice a concentration of high acuity cells appear near the center of the sampling array. Furthermore, the cells spread their individual centers to create a sampling lattice which covers the full image.</p>
<h2 id="controlling-the-emergence-of-a-fovea">Controlling the Emergence of a Fovea</h2>
<p>Since our model is in-silico, we can endow our model with properties not found in nature to see what other layouts will emerge. For example, we can give our model the ability to zoom in and out of an image by rescaling the entire grid to cover a smaller or larger area:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/rescale.gif" width="300" /><br />
<i>
Retinal sampling lattice which also has the ability to rescale itself.
</i>
</p>
<p>We show the difference in the learned retinal layout below. For comparison, the left image is the retinal layout when our model does not have the ability to zoom while the right image is the layout learned when zooming is possible.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/translate_only.png" width="200" />
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/translate_and_zoom.png" width="200" />
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
(Left) Retinal lattice of a model only able to translate. (Right) Retinal lattice of a model able to translate and zoom.
</i>
</p>
<p>When our attention model is able to zoom, a very different layout emerges. Notice there is much less diversity in the retinal ganglion cells. They cells keep many of the properties they were initialized with.</p>
<p>To get a better idea of the utility of our learned retinal layout, we compared the performance of a retina with a fixed (unlearnable) lattice, a learnable lattice without zoom and a learnable lattice with zoom:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/comparison.png" width="500" /><br />
<i>
Performance on two versions (Dataset 1 and Dataset 2) of the Cluttered MNIST dataset. Dataset 2 contains randomly resized MNIST digits making it more difficult than Dataset 1.
</i>
</p>
<p>Perhaps not surprisingly, having zoom/learnable lattice significantly outperform a fixed lattice which can only translate. But what is interesting is the performance between a learnable lattice only with the ability to translate performs about as well as a model which can also zoom. This is further evidence that zooming and a foveal layout of the retinal lattice could be serving the same functional purpose.</p>
<h2 id="interpretability-of-attention">Interpretability of Attention</h2>
<p>Earlier, we described the utility of attention in efficiently utilizing limited resources. Attention also provides insight into how the complex systems we build function internally. When our vision model attends over specific parts of an image during its processing, we get an idea of what the model deems relevant to perform a task. In our case, the model solves the recognition task by learning to place its fovea over the digit indicating its utility in classifying the digit. We also see the model in the bottom row utilizes its ability to zoom for the same purpose.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/fovea/3by4.png" width="600" /><br />
<i>
The attention movements our model takes unrolled in time. Model with fixed lattice (top), learnable lattice (center), learnable lattice with zoom ability (bottom).
</i>
</p>
<h2 id="conclusion">Conclusion</h2>
<p>Often we find loose inspiration from biology to motivate our machine learning models. The work by Hubel and Wiesel <sup id="fnref:3"><a href="#fn:3" class="footnote">1</a></sup> inspired the Neocognitron model <sup id="fnref:4"><a href="#fn:4" class="footnote">2</a></sup> which in turn inspired the Convolutional Neural Network <sup id="fnref:5"><a href="#fn:5" class="footnote">3</a></sup> as we know it today. In this work, we go in the other direction where we try to explain a physical feature we observe in biology using the computational models developed in deep learning<sup id="fnref:2"><a href="#fn:2" class="footnote">4</a></sup>. In the future, these results may lead us to think about new ways of designing the front end of active vision systems, modeled after the foveated sampling lattice of the primate retina. We hope this virtuous cycle of inspiration continues in the future.</p>
<p>If you want to learn more, check out our paper published in ICLR 2017:</p>
<p><a href="https://arxiv.org/abs/1611.09430">Emergence of foveal image sampling from learning to attend in visual scenes</a><br />
(https://arxiv.org/abs/1611.09430)</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:3">
<p>Hubel, David H., and Torsten N. Wiesel. “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex.” The Journal of physiology 160.1 (1962): 106-154. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Fukushima, Kunihiko, and Sei Miyake. “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition.” Competition and cooperation in neural nets. Springer, Berlin, Heidelberg, 1982. 267-285. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>LeCun, Yann, et al. “Handwritten digit recognition with a back-propagation network.” Advances in neural information processing systems. 1990. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Gregor, Karol, et al. “DRAW: A Recurrent Neural Network For Image Generation.” International Conference on Machine Learning. 2015. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 09 Nov 2017 01:00:00 -0800
http://bair.berkeley.edu/blog/2017/11/09/learn-to-attend-fovea/
http://bair.berkeley.edu/blog/2017/11/09/learn-to-attend-fovea/DART: Noise Injection for Robust Imitation Learning<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/bed_making_gif.gif" alt="Bed-Making GIF" width="600" /><br />
<i>
Toyota HSR Trained with DART to Make a Bed.
</i>
</p>
<p>In Imitation Learning (IL), also known as Learning from Demonstration (LfD), a
robot learns a control policy from analyzing demonstrations of the policy
performed by an algorithmic or human supervisor. For example, to teach a robot
make a bed, a human would tele-operate a robot to perform the task to provide
examples. The robot then learns a control policy, mapping from images/states to
actions which we hope will generalize to states that were not encountered during
training.</p>
<p>There are two variants of IL: Off-Policy, or Behavior Cloning, where the
demonstrations are given independent of the robot’s policy. However, when the
robot encounters novel risky states it may not have learned corrective actions.
This occurs because of “covariate shift” a known challenge, where the states
encountered during training differ from the states encountered during testing,
reducing robustness. Common approaches to reduce covariate shift are On-Policy
methods, such as DAgger, where the evolving robot’s policy is executed and the
supervisor provides corrective feedback. However, On-Policy methods can be
difficult for human supervisors, potentially dangerous, and computationally
expensive.</p>
<p>This post presents a robust Off-Policy algorithm called DART and summarizes how
injecting noise into the supervisor’s actions can improve robustness. The
injected noise allows the supervisor to provide corrective examples for the type
of errors the trained robot is likely to make. However, because the optimized
noise is small, it alleviates the difficulties of On-Policy methods. Details on
DART are in a paper that will be presented at <a href="http://www.robot-learning.org/">the 1st Conference on Robot Learning in
November</a>.</p>
<p>We evaluate DART in simulation with an algorithmic supervisor on MuJoCo tasks
(Walker, Humanoid, Hopper, Half-Cheetah) and physical experiments with human
supervisors training a Toyota HSR robot to perform grasping in clutter, where a
robot must search through clutter for a goal object. Finally, we show how
DART can be applied in a complex system that leverages both classical robotics
and learning techniques to teach the first robot to make a bed. For
researchers who want to study and use robust Off-Policy approaches, <strong>we
additionally announce the release of
<a href="https://berkeleyautomation.github.io/DART/">our codebase</a>
on GitHub</strong>.</p>
<!--more-->
<h1 id="imitation-learnings-compounding-errors">Imitation Learning’s Compounding Errors</h1>
<p>In the late 80s, Behavior Cloning was applied to teach cars how to drive, with a
project known as ALVINN (Autonomous Land Vehicle in a Neural Network). In
ALVINN, a neural network was trained on driving demonstrations and learned a
policy that mapped images of the road to the supervisor’s steering angle.
Unfortunately, after learning, the policy was unstable, as indicated in the
following video:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/alvinn.gif" alt="ALVINN." /><br />
<i>
ALVINN Suffering from Covariate Shift.
</i>
</p>
<p>The car would start drifting to side of the road and not know how to recover.
The reason for the car’s instability was that no data was collected on the
side of the road. During the data collection the supervisor always drove along
the center of the road; however, if the robot began to drift from the
demonstrations, it would not know how to recover because it saw no examples.</p>
<p>This example, along with many others that researchers have tried, shows that
Imitation Learning cannot be entirely solved with Behavior Cloning. In
traditional Supervised Learning, the training distribution is de-coupled from
the learned model, whereas in Imitation Learning, <em>the robot’s policy affects
what state is queried next</em>. Thus the training and testing distributions are no
longer equivalent, and this mismatch is known as
<strong><a href="http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/node8.html">covariate shift</a></strong>.</p>
<p>To reduce covariate shift, the objective of Imitation Learning had to be
modified. The robot should now be expected to match the supervisor on the states
it is likely to visit. Thus, if ALVINN is likely to drift to the side of the
road, we expect that it will know what to do in those states.</p>
<p>A robot’s policy and a supervisor’s policy can be denoted as $\pi_{\theta}$ and
<script type="math/tex">\pi_{\theta^*}</script>, where $\pi$ is a function mapping state to action and
$\theta$ is a parametrization, like weights in a neural network. We can
measure how close two policies are by what actions they apply at a given state,
which we refer to as the surrogate loss, $l$. A common surrogate loss is the
squared Euclidean distance:</p>
<script type="math/tex; mode=display">l(\pi_{\theta}(x), \pi_{\theta^*}(x)) = \|\pi_{\theta^*}(x)
-\pi_{\theta}(x)\|^2_2.</script>
<p>Finally, we need a distribution over trajectories $p(\xi|\theta)$, which
indicate the trajectories, $\xi$, that are likely under the current policy
$\pi_{\theta}$. Our objective can then be written as follows:</p>
<script type="math/tex; mode=display">\underset{\theta}{\mbox{min}}\; E_{p(\xi|\theta)} \underbrace{\sum^T_{t=1}
l(\pi_{\theta}(x_t), \pi_{\theta^*}(x_t)) }_{J(\theta,\theta^*|\xi)}.</script>
<p>Hence we want to minimize the expected surrogate loss on the distribution of
states induced by the robot’s policy. This objective is challenging to solve
because we don’t know what the robot’s policy is until after data has been
collected, which creates a <em>chicken and egg</em> situation. We will now discuss an
iterative On-Policy approach to overcome this problem.</p>
<h1 id="reducing-shift-with-on-policy-methods">Reducing Shift with On-Policy Methods</h1>
<p>A large body of work from Ross and Bagnell [6,7], has examined the theoretical
consequences of covariate shift. In particular, they proposed the DAgger
algorithm to help correct for it. DAgger can be thought of as an On-Policy
algorithm — which rolls out the current robot policy during learning.</p>
<p>The key idea of DAgger is to collect data from the current robot policy and
update the model on the aggregate dataset. Implementation of DAgger requires
iteratively rolling out the current robot policy, querying a supervisor for
feedback on the states visited by the robot, and then updating the robot on the
aggregate dataset across all iterations.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/DAgger.png" alt="DAGGER." width="600" /><br />
<i>
The DAgger Algorithm.
</i>
</p>
<p>Two years ago, we used DAgger to teach a robot to perform grasping in clutter
(shown below), which requires a robot to search through objects via pushing to
reach a desired goal object. Imitation Learning was advantageous in this task
because we didn’t need to explicitly model the collision of multiple non-convex
objects.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/izzy.gif" alt="Mechnical Search 1." width="250" height="250" /><br />
<i>
Planar Grasping in Clutter.
</i>
</p>
<p>Our planar robot had a neural network policy that mapped images of the workspace
to a control signal. We trained it with DAgger on 160 expert demonstrations.
While we were able to teach the robot how to perform the task with a 90% success
rate, we encountered several major hurdles that made it challenging to increase
the complexity of the task.</p>
<h1 id="challenges-with-on-policy-methods">Challenges with On-Policy Methods</h1>
<p>After applying DAgger to teach our robot, we wanted to study and better
understand 3 key limitations related to On-Policy methods in order to scale up
to more challenging tasks.</p>
<h2 id="limitation-1-providing-feedback">Limitation 1: Providing Feedback</h2>
<p>In order to apply feedback to our robot, we had to do so retroactively with a
labeling interface, shown below.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/feedback.gif" alt="Mechnical Search 2." width="250" height="250" /><br />
<i>
Supervisor Providing Retroactive Feedback.
</i>
</p>
<p>A supervisor had to manually move the pink overlay to tell the robot what it
should have done after execution. When we tried to retrain the robot with
different supervisors, we found it was very challenging to provide this feedback
for most people. You can think of a human supervisor as a controller that needs
to constantly adjust their actions to obtain the desired effect. However, with
retroactive feedback the human must simulate what the action would be without
seeing the outcome, which is quite unnatural.</p>
<p>To test this hypothesis, we performed a human study with 10 participants to
compare DAgger against Behavior Cloning, where each participant was asked to
train a robot to perform planar part singulation. We found that Behavior Cloning
out-performed DAgger, suggesting that while DAgger mitigates the shift, in
practice it may add systematic noise to the supervisor’s signal [2].</p>
<h2 id="limitation-2-safety">Limitation 2: Safety</h2>
<p>On-Policy methods have the additional burden of needing to roll-out the current
robot’s policy during execution. While our robot was able to perform the task at
the end of training, for most of learning it wasn’t successful:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/policy_gif.gif" alt="Mechnical Search 3." width="250" height="250" /><br />
<i>
Robot Rolling Out Unsuccessful Policy.
</i>
</p>
<p>In unstructured environments, such as a self-driving car or home robotics, this
can be problematic. Ideally, we would like to collect data with the robot while
maintaining high performance throughout the entire process.</p>
<h2 id="limitation-3-computation">Limitation 3: Computation</h2>
<p>Finally, when building systems either in simulation or the real world, we want
to collect large amounts of data in parallel and update our policy sparingly.
Neural networks can require significant computation time for retraining.
However, On-Policy methods suffer when the policy is not updated frequently
during data collection. Training on a large batch size of new data can cause
significant changes to the current policy, which can push the robot’s
distribution away from the previously collected data and make the aggregate
dataset stale.</p>
<p>Variants of On-Policy methods have been proposed to solve each of these problems
individually. For example, Ho et al. got rid of the retroactive feedback by
proposing, GAIL, which uses Reinforcement Learning to reduce covariate shift
[8]. Zhang et al. examined how to detect when the policy is about to deviate to
a risky state and asks the supervisor to take over [4]. Sun et al. has explored
incremental gradient updates to the model instead of a full retrain, which is
computationally cheaper [5].</p>
<p>While these methods can each solve some of these problems, ideally we want a
solution to address all three. Off-Policy algorithms like Behavior Cloning do
not exhibit these problems because they passively sample from the supervisor’s
policy. Thus, we decided instead of extending On-Policy methods it might be more
beneficial to make Off-Policy methods more robust.</p>
<h1 id="off-policy-with-noise-injection">Off-Policy with Noise Injection</h1>
<p>Off-Policy methods, like Behavior Cloning, can in fact have low covariate shift.
If the robot is able to learn the supervisor’s policy perfectly, then it should
visit the same states as the supervisor. In prior work we empirically found in
simulation that with sufficient data and expressive learners, such as deep
neural networks, Behavior Cloning is at parity with DAgger [2].</p>
<p>In real world domains, though, it is unlikely that a robot can perfectly match a
supervisor. Machine Learning algorithms generally have a long tail in terms of
sample complexity, so the amount of data and computation needed to perfectly
match a supervisor may be unreasonable. However, it is likely that we can
achieve small non-zero test error.</p>
<p>Instead of attempting to perfectly learn the supervisor, we propose simulating
small amounts of error in the supervisor’s policy to better mimic the trained
robot. Injecting noise into the supervisor’s policy during teleoperation is one
way to simulate this small test error during data collection. Noise injection
forces the supervisor to provide corrective examples to these small disturbances
as they try to perform the task. Shown below is the intuition of how noise
injection creates a funnel of corrective examples around the supervisor’s
distribution.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/dart_intuition.png" alt="DART intuition." width="300" />
<br />
<i>
Noise Injection forces the supervisor to provide corrective examples,<br />
so that the robot can learn to recover.
</i>
</p>
<p>Additionally, because we are only injecting small noise levels, we don’t suffer
as many limitations compared to On-Policy methods. A supervisor can normally be
robust to small random disturbances that are concentrated around their current
action. We will now formalize noise injection a bit to help understand its
effect more.</p>
<p>Denote by <script type="math/tex">p(\xi|\pi_{\theta^*},\psi)</script> a distribution over trajectories with
noise injected into the supervisor’s distribution
<script type="math/tex">\pi_{\theta^*}(\mathbf{u}|\mathbf{x},\psi)</script>. The parameter $\psi$ represents
the sufficient statistics that define the noise distribution. For example, if
Gaussian noise is injected parameterized by $\psi$, then
<script type="math/tex">\pi_{\theta^*}(\mathbf{u}|\mathbf{x},\psi) =
\mathcal{N}(\pi_{\theta^*}(\mathbf{x}), \Sigma)</script>. Note, the stochastic
supervisor’s distribution is a slight abuse of notation.
<script type="math/tex">\pi_{\theta^*}(\mathbf{u}|\mathbf{x},\psi)</script> is a distribution over actions,
where as <script type="math/tex">\pi_{\theta^*}(\mathbf{x})</script> is a deterministic function mapping to a
single action.</p>
<p>Similar to Behavior Cloning, we can sample demonstrations from the
noise-injected supervisor and minimize the expected loss via standard supervised
learning techniques:</p>
<script type="math/tex; mode=display">\theta^R = \underset{\theta}{\mbox{argmin }} E_{p(\xi|\pi_{\theta^*},\psi)}
J(\theta,\theta^* | \xi)</script>
<p>This equation, though, does not explicitly minimize the covariate shift for
arbitrary choices of $\psi$; the $\psi$ needs to be chosen to best simulate the
error of the final robot’s policy, which may be complex for high dimensional
action spaces. One approach to choose $\psi$ is grid-search, but this requires
expensive data collection, which can be prohibitive in the physical world or in
high fidelity simulation.</p>
<p>Instead of grid-search, we can formulate the selection of $\psi$ as a maximum
likelihood problem. The objective is to increase the probability of the
supervisor applying the robot’s control.</p>
<script type="math/tex; mode=display">\underset{\psi}{\mbox{min}} \: E_{p(\xi|\pi_{\theta^R})} -\sum^{T-1}_{t=0} \:
\mbox{log} [\pi_{\theta^*}(\pi_{\theta^R}(\mathbf{x_t})|\mathbf{x_t},\psi)]</script>
<p>This objective states that we want the noise injected supervisor to try and
match the final robot’s policy. In the paper, we show that this explicitly
minimizes the distance between the supervisor and robot’s distribution. A clear
limitation of this optimization problem though is that it requires knowing the
final robot’s distribution $p(\xi|\pi_{\theta^R})$, which is determined only
after the data is collected. In the next section, we present DART, which
applies an iterative approach to the optimization.</p>
<h2 id="dart-disturbances-for-augmenting-robot-trajectories">DART: Disturbances for Augmenting Robot Trajectories</h2>
<p>The above objective cannot be solved because $p(\xi|\pi_{\theta^R})$ is not
known until after the robot has been trained. We can instead iteratively sample
from the supervisor’s distribution with the current noise parameter, $\psi_k$,
and minimize the negative log-likelihood of the noise-injected supervisor taking
the current robot’s, $\pi_{\hat{\theta}}$, control.</p>
<script type="math/tex; mode=display">\hat{\psi}_{k+1} = \underset{\psi}{\mbox{argmin}} \: E_{p(\xi|\pi_{\theta^*},
\psi_k)} -\sum^{T-1}_{t=0}\mbox{log} \:
[\pi_{\theta^*}(\pi_{\hat{\theta}}(\mathbf{x_t})|\mathbf{x_t},\psi)]</script>
<p>The above iterative process can be slow to converge because it is optimizing the
noise with respect to the current robot’s policy. We can obtain a better
estimate by observing that the supervisor should simulate as much expected error
as the final robot policy, <script type="math/tex">E_{p(\xi|\pi_{\theta^R})}
J(\theta^R,\theta^*|\xi)</script>. It is possible that we have some knowledge of this
quantity from previously training on similar domains. In the paper, we show how
to incorporate this knowledge in the form of a prior. For some common noise
distributions, the objective can be solved in closed form, as detailed in the
paper. Thus, the optimization problem determines the shape of the noise injected
and the prior helps determine the magnitude.</p>
<p>Our algorithm DART, iteratively solves this optimization problem to best set the
noise term. DART is still an iterative algorithm like On-Policy methods.
<em>Through the iterative process, DART optimizes $\psi$ to better simulate the
error in the final robot’s policy.</em></p>
<h1 id="evaluating-dart">Evaluating DART</h1>
<p>To understand how effectively DART reduces covariate shift and to determine if
it suffers from similar limitations as On-Policy methods, we ran experiments in
4 MuJoco domains, as shown below. The supervisor was a policy trained with TRPO
and the noise we injected was Gaussian.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/ni_a_results-eps-converted-to.png" alt="DART results." /><br />
</p>
<p>To test if DART suffers from updating the policy after larger batches, we only
updated the model after every $K$ demonstrations for all experiments. DAgger was
updated after every demonstration and DAgger-B was updated after every $K$. The
results show that DART is able to have the same performance as DAgger, but is
significantly faster in terms of computation. DAgger-B is relatively similar in
computation time, but suffers significantly in performance, suggesting DART can
significantly reduce computation time.</p>
<p>We finally compared DART to Behavior Cloning in a human study for the task of
grasping in clutter, shown below. In the task, a Toyota HSR robot was trained to
reach a goal object by pushing objects away with its gripper.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/hsr.gif" alt="DART results." /><br />
<i>
Toyota HSR Trained with DART for Grasping in Clutter.
</i>
</p>
<p>The task is more complex than the one above because the robot now sees images of
the world taken from an eye-in-hand camera. We compared 4 humans subjects and
saw that by injecting noise in the controller, we were able to receive a win
over Behavior Cloning of 62%. DART was able to reduce the shift on the task with
human supervisors.</p>
<h1 id="robotic-bed-making-a-testbed-for-covariate-shift">Robotic Bed Making: A Testbed for Covariate Shift</h1>
<p>To better understand how errors compound in real world robotic systems, we built
a literal test bed. Robotic Bed Making has been a challenging task in robotics
due to it requiring mobile manipulation of deformable objects and sequential
planning. Imitation Learning is one way to sidestep some of the challenges of
deformable object manipulation because it doesn’t require modeling the bed
sheets.</p>
<p>The goal of our bed making system was to have a robot learn to stretch the
sheets over the bed frame. The task was designed so that the robot must learn
one policy to decide where to grasp the bed sheet and another transition policy
to decide whether the robot should try again or switch to the other bed side.
We trained the bed making policy with 50 demonstrations.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/bed_system.png" alt="DART results, bed-making." /><br />
<i>
Bed Making System.
</i>
</p>
<p>DART was applied to inject Gaussian noise into grasping policy because we
assumed there would be considerable error in determining where to grasp. The
optimized covariance matrix decided to inject more noise in the horizontal
direction of the bed, because that is where the edge of the sheet varied more
significantly and subsequently the robot had higher error.</p>
<p>In order to test how large covariate shift was in the system, we can take our
trained policy $\pi_{\theta^R}$ and write its performance with the following
decomposition.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
E_{p(\xi |\pi_{\theta^R})} J(\theta^R,\theta^*|\xi) &= \underbrace{E_{p(\xi |\pi_{\theta^R})} \sum^T_{t=1} l(\pi_{\theta}(x_t), \pi_{\theta^*}(x_t)) - E_{p(\xi |\pi_{\theta^*}, \psi)} \sum^T_{t=1} l(\pi_{\theta}(x_t), \pi_{\theta^*}(x_t))}_{\text{Shift}} \\
&+ \underbrace{E_{p(\xi |\pi_{\theta^*},\psi)} \sum^T_{t=1} l(\pi_{\theta}(x_t), \pi_{\theta^*}(x_t)) }_{\text{Loss}},
\end{align} %]]></script>
<p>where the first term on the right-hand side corresponds to the covariate shift.
Intuitively, the covariate shift is the difference between the expected error on
the robot’s distribution and the supervisor’s distribution. When we measured
these quantities on the bed making setup, we observed noticeable covariate shift
in the transition policy trained with Behavior Cloning.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/dart/cs_graph.png" alt="DART results, covariate shift." width="500" /><br />
<i>
Covariate Shift in Bed Making Task.
</i>
</p>
<p>We attribute this covariate shift due to the fact that with Behavior Cloning the
robot rarely saw unsuccessful demonstrations; thus the transition policy never
knew what failure was. DART gave a more diverse set of states, which allowed the
policy to have better class balance. DART was able to train a robust policy that
allowed it to perform the bed making task even when novel objects were placed on
the bed, as shown at the beginning of the blog post. When distractor objects are
placed on the bed DART obtained a 97% sheet coverage, whereas Behavior Cloning
achieved only 63%.</p>
<p>These initial results suggest that covariate shift can occur in modern day
systems that use learning components. We will soon release a longer preprint on
the Bed Making Setup for more information.</p>
<p>DART presents a way to correct for shift via the injection of small optimized
noise. Going forward, we are considering more complex noise models that better
capture the temporal structure of the robot’s error.</p>
<p>(For papers and updated information, <a href="http://autolab.berkeley.edu">see UC Berkeley’s AUTOLAB website</a>.)</p>
<h2 id="references">References</h2>
<ol>
<li>
<p>Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, Ken Goldberg ; DART:
Noise Injection for Robust Imitation Learning Proceedings of the 1st Annual
Conference on Robot Learning, PMLR 78:143-156, 2017.</p>
</li>
<li>
<p>M. Laskey, C. Chuck, J. Lee, J. Mahler, S. Krishnan, K. Jamieson, A. Dragan,
and K. Goldberg. Comparing human-centric and robot-centric sampling for robot
deep learning from demonstrations. Robotics and Automation (ICRA), 2017 IEEE
International Conference on, pages 358-365. IEEE, 2017</p>
</li>
<li>
<p>M. Laskey, J. Lee, C. Chuck, D. Gealy, W. Hsieh, F. T. Pokorny, A. D. Dragan,
and K. Goldberg. Robot grasping in clutter: Using a hierarchy of supervisors for
learning from demonstrations. In Automation Science and Engineering (CASE), 2016
IEEE International Conference on, pages 827–834. IEEE, 2016.</p>
</li>
<li>
<p>Zhang, Jiakai, and Kyunghyun Cho. “Query-Efficient Imitation Learning for
End-to-End Simulated Driving.” In AAAI, pp. 2891-2897. 2017.</p>
</li>
<li>
<p>W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell. Deeply
aggrevated: Differentiable imitation learning for sequential prediction.
Proceedings of the 34th International Conference on Machine Learning, PMLR
70:3309-3318, 2017.</p>
</li>
<li>
<p>Ross, Stéphane, Geoffrey J. Gordon, and Drew Bagnell. “A reduction of
imitation learning and structured prediction to no-regret online learning.”
International Conference on Artificial Intelligence and Statistics. 2011.</p>
</li>
<li>
<p>S. Ross and D. Bagnell. Efficient reductions for imitation learning. In
International Conference on Artificial Intelligence and Statistics, pages
661–668, 2010.</p>
</li>
<li>
<p>Ho, Jonathan, and Stefano Ermon. “Generative adversarial imitation learning.”
Advances in Neural Information Processing Systems. 2016.</p>
</li>
</ol>
Thu, 26 Oct 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/10/26/dart/
http://bair.berkeley.edu/blog/2017/10/26/dart/Learning Long Duration Sequential Task Structure From Demonstrations with Application in Surgical Robotics<p style="text-align:center;">
<!--@Daniel arrange this however you want-->
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/cutting-gif.gif" height="180" style="margin: 10px;" />
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/binpicking-gif.gif" height="180" style="margin: 10px;" />
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/debridement-gif.gif" height="180" style="margin: 10px;" />
<br />
</p>
<p>Deep imitation learning and deep reinforcement learning have potential to learn
robot control policies that map high-dimensional sensor inputs to controls.
While these approaches have been very successful at learning short duration tasks, such
as grasping (Pinto and Gupta 2016, Levine et al. 2016) and peg insertion (Levine
et al. 2016), scaling learning to longer time horizons can require a prohibitive
amount of demonstration data—whether acquired from experts or self-supervised.
Long-duration sequential tasks suffer from the classic problem of “temporal
credit assignment”, namely, the difficulty in assigning credit (or blame) to
actions under uncertainty of the time when their consequences are observed
(Sutton 1984). However, long-term behaviors are often composed of short-term
skills that solve decoupled subtasks. Consider designing a controller for
parallel parking where the overall task can be decomposed into three phases:
pulling up, reversing, and adjusting. Similarly, assembly tasks can often be
decomposed into individual steps based on which parts need to be manipulated.
These short-term skills can be parametrized more concisely—as an analogy,
consider locally linear approximations to an overall nonlinear function—and
this reduced parametrization can be substantially easier to learn.</p>
<p>This post summarizes results from three recent papers that propose algorithms
that learn to decompose a longer task into shorter subtasks. We report
experiments in the context of autonomous surgical subtasks and we believe the
results apply to a variety of applications from manufacturing to home robotics.
We present three algorithms: Transition State Clustering (TSC), Sequential
Windowed Inverse Reinforcement Learning (SWIRL), and Deep Discovery of
Continuous Options (DDCO). TSC considers robustly learning important switching
events (significant changes in motion) that occur across all demonstrations.
SWIRL proposes an algorithm that approximates a value function by a sequence of
shorter term quadratic rewards. DDCO is a general framework for imitation
learning with a hierarchical representation of the action space. In retrospect,
all three algorithms are special cases of the same general framework, where the
demonstrator’s behavior is generatively modeled as a sequential composition of
unknown closed-loop policies that switch when reaching parameterized “transition
states”.</p>
<!--more-->
<h1 id="application-to-surgical-robotics">Application to Surgical Robotics</h1>
<p>Robots such as Intuitive Surgical’s da Vinci have facilitated millions of
surgical procedures worldwide using local teleoperation. Automation of surgical
sub-tasks has the potential to reduce surgeon tedium and fatigue, operating
time, and enable supervised tele-surgery over higher-latency networks. Designing
surgical robot controllers is particularly difficult due to a limited field of
view and imprecise actuation.</p>
<p>As a concrete task, pattern cutting is one of the Fundamentals of Laparoscopic
Surgery, a training suite required of surgical residents. In this standard
surgical training task, the surgeon must cut and remove a pattern printed on a
sheet of gauze, and is scored on time and accuracy:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/pattern-cutting-task.png" alt="Figure 1: Pattern Cutting Task, from the Fundamentals of Laparoscopic Surgery." /><br />
<i>
Pattern cutting task from the Fundamentals of Laparoscopic Surgery.
</i>
</p>
<p>In (Murali 2015), we manually coded this task using hand-crafted Deterministic
Finite Automaton on the da Vinci surgical robot. The DFA integrated 10 different
manipulation primitives and two computer vision based checks:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/dfa-pattern-cutting.png" alt="Figure 2: DFA for Pattern Cutting." width="600" /><br />
<i>
Deterministic finite automaton from Murali et al. 2016 to automate pattern
cutting.
</i>
</p>
<p>Designing this DFA required painstaking trial-and-error, and perceptual checks
required constant tuning to account for lighting and registration changes. This
motivated us to consider the extent to which we could learn such structure from
demonstration data. This blog post describes our efforts over the last three
years at learning hierarchical representations from demonstrations. This
research has helped us automate several surgical robotic tasks with minimal
expert design of the DFA, as shown in the three GIFs at the top of the post.</p>
<h1 id="learning-transition-conditions">Learning Transition Conditions</h1>
<p>The first paper, Transition State Clustering (Krishnan et al. 2015), explores
the problem of learning transition conditions from demonstrations, i.e.,
conditions that trigger a switch or a transition between manipulation behaviors
in a task. In many important tasks, while the actual motions may vary and be
noisy, each demonstration contains roughly the same sequence of primitive
motions. This consistent, repeated structure can be exploited to infer global
transition criteria by identifying state-space conditions correlated with
significant changes in motion. By assuming a known sequential order of
primitives, the problem reduces to segmenting each trajectory and corresponding
those segments across trajectories. This involves finding a common set of
segment-to-segment transition events.</p>
<p>We formalized this intuition in an algorithm called Transition State Clustering
(TSC). Let <script type="math/tex">D=\{d_i\}</script> be a set of demonstrations of a robotic task. Each
demonstration of a task $d$ is a discrete-time sequence of $T$ state vectors in
a feature-space $\mathcal{X}$. The feature space is a concatenation of kinematic
features $X$ (e.g., robot position) and sensory features $V$. These were
low-dimensional visual features from the environment calculated by hard-coded
image processing and manual annotation.</p>
<p>A segmentation of a task is defined as a function $\mathcal{S}$ that maps each
trajectory to a non-decreasing sequence of integers in ${1,2,…,k}$. This
function tells us more than just the endpoints of segments, since it also labels
each segment according to its sub-task. By contrast, a transition indicator
function $\mathcal{T}$ is maps each demonstration $d$ to a sequence of
indicators in ${0,1}$:</p>
<script type="math/tex; mode=display">\mathbf{T}: d \mapsto ( a_t )_{1,...,|d|}, a_t \in {0,1}.</script>
<p>such that <script type="math/tex">\mathcal{T}(d)_t</script> indicates whether the demonstration switched from
one sub-task to another after time $t$. For a demonstration $d_i$, let $o_{i,t}$
denote the kinematic state, visual state, and time $(x,v,t)$ at time $t$.
Transition States are the set of state-time tuples where the indicator is 1:</p>
<script type="math/tex; mode=display">\Gamma = \bigcup_{i}^N ~\{o_{i,t} \in d_i ~: \mathbf{T}(d_i)_t = 1\}.</script>
<p>In TSC, we model the probability distribution that generates $\Gamma$ as a
Gaussian Mixture Model and identify the mixture components. These components
identify regions of the state space correlated with candidate transitions. We
can take any motion-based model for detecting changes in behavior and generate
candidates. Then, we probabilistically ground these candidate transitions in
state-space and perceptual conditions that are consistent across demonstrations.
Intuitively, this algorithm consists of two steps: first segmentation, and then
clustering the segment end-points.</p>
<p>There are a number of important implementation details to make this model work
in practice on real noisy data. Since the kinematic and visual features often
have very different scales and topological properties, we often have to model
them separately during the clustering step. We hierarchically apply a GMM model
by first performing a hard clustering on the kinematic features, and then within
each cluster fitting the probabilistic model over the perceptual features. This
allows us to prune out clusters that are not representative (i.e., do not have
transitions from all demonstrations). Furthermore, Hyper-parameter selection is
a known problem in mixture models. Recent results in Bayesian statistics can
mitigate some of these problems by defining a soft prior of the number of
mixtures. The Dirichlet Process (DP) defines a distribution over the parameters
of discrete distributions, in our case, the probabilities of a categorical
distribution, as well as the size of its support $m$ (Kulis 2011). The
hyper-parameters of the DP can be inferred with variational Expectation
Maximization.</p>
<p>In the pattern cutting task, TSC found the following transition conditions:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/pattern-cutting-concept.png" height="175" style="margin: 30px;" alt="Figure 3: Conceptual diagram of pattern cutting." />
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/pattern-cutting.png" height="200" alt="Figure 3: Conceptual diagram of pattern cutting." /><br />
<i>
Surgical pattern cutting task. Left: manually identified transitions. Right:
automatically discovered transition states (a) and transition state clusters
(b).
</i>
</p>
<p>We marked 6 manually identified primitive motions from (Murali et al. 2015): (1)
start, (2) notch, (3) finish 1st cut, (4) cross-over, (5) finish 2nd cut, and
(6) connect the two cuts. TSC automatically identifies 7 segments, which
correspond well to our prior work. It is worth noting that there is one extra
cluster (marked 2’), that does not correspond to a transition in the manual
segmentation.</p>
<p>At 2’, the operator finishes a notch and begins to cut. While at a logical
level, notching and cutting are both penetration actions, they correspond to two
different motion regimes due to the positioning of the end-effector. TSC
separates them into different clusters even though the human annotators
overlooked this important transition.</p>
<h1 id="connection-to-inverse-reinforcement-learning">Connection to Inverse Reinforcement Learning</h1>
<p>We next explored how the transitions learned by TSC can be used to shape rewards
in long horizon tasks. Sequential Windowed Inverse Reinforcement Learning
(Krishnan et al. 2016), models a task as a sequence of quadratic reward
functions</p>
<script type="math/tex; mode=display">\mathbf{R}_{seq} = [R_1, \ldots ,R_k ]</script>
<p>and transition regions</p>
<script type="math/tex; mode=display">G = [ \rho_1, \ldots,\rho_k ]</script>
<p>such that $R_1$ is the reward function until $\rho_1$ is reached, after which
$R_2$ becomes the reward and so on.</p>
<p>We assume that we have access to a supervisor that provides demonstrations which
are optimal w.r.t an unknown reward function $\mathbf{R}^*$ (not necessarily
quadratic), and which reach each $\rho \in G$ (also unknown) in the same order.
SWIRL is an algorithm to recover <script type="math/tex">\mathbf{R}_{seq}</script> and $G$ from the
demonstration trajectories. SWIRL applies to tasks with a discrete or continuous
state-space and a discrete action-space. The state space can represent spatial,
kinematic, or sensory states (e.g., visual features), as long as the
trajectories are smooth and not very high-dimensional. Finally,
$\mathbf{R}_{seq}$ and $G$ can be used in an RL algorithm to find an optimal
policy for a task.</p>
<p>TSC can be interpreted as inferring the subtask transition regions $G$. Once the
transitions are found, SWIRL applies Maximum Entropy Inverse Reinforcement
Learning to find a local quadratic reward function that guides the robot to the
transition condition. Segmentation further simplifies the estimation of dynamics
models, which are required for inference in MaxEnt-IRL, since many complex
systems can be locally approximated linearly in a short time horizon. The goal
of MaxEnt-IRL is to find a reward function such that an optimal policy w.r.t
that reward function is close to the expert demonstration. The agent is modeled
as nosily optimal, where it takes actions from a policy $\pi$:</p>
<script type="math/tex; mode=display">\pi(a \mid s, \theta) \propto \exp\{A_\theta(s,a)\}.</script>
<p>$A_\theta$ is the advantage function (gap between the values of action $a$ and
of the optimal action in state $s$) for the reward parametrized by $\theta$.
The objective is to maximize the log-likelihood that the demonstration
trajectories were generated by $\theta$. In MaxEnt-IRL, this objective can be
estimated reliably in two cases, discrete and linear-Gaussian systems, since it
requires an efficient forward search of the policy given a particular reward
parametrized by $\theta$. Thus, we assume that our demonstrations can be modeled
either discretely or with linear dynamics.</p>
<p>Learning a policy from $\mathbf{R}_{seq}$ and $G$ is nontrivial because solving
$k$ independent problems neglects any shared structure in the value function
during the policy learning phase (e.g., a common failure state). Jointly
learning over all segments introduces a dependence on history, namely, any
policy must complete step $i$ before step $i+1$. Learning a memory-dependent
policy could lead to an exponential overhead of additional states. SWIRL
exploits the fact that TSC is in a sense a Markov process, and shows that the
problem can be posed as a proper MDP in a lifted state-space that includes an
indicator variable of the highest-index ${1,…,k}$ transition region that has
been reached so far.</p>
<p>SWIRL applies a variant of Q-Learning to optimize the policy over the sequential
rewards. The basic change to the algorithm is to augment the state-space with an
indicator vector that indicates the transition regions that have been reached.
Each of the rollouts now records a tuple</p>
<script type="math/tex; mode=display">(s, i \in {0,...,k-1},a,r, s', i' \in {0,...,k-1})</script>
<p>that additionally stores this information. The Q function is now defined over
states, actions, and segment index–which also selects the appropriate local
reward function:</p>
<script type="math/tex; mode=display">Q(s,a,v) = R_k(s,a) + \arg \max_{a} Q(s',a, k')</script>
<p>We also need to define an exploration policy, i.e., a stochastic policy with
which we will collect rollouts. To initialize the Q-Learning, we apply
Behavioral Cloning locally for each of the segments to get a policy $\pi_i$. We
apply an $\epsilon$-greedy version of these policies to collect rollouts.</p>
<p>We evaluated SWIRL on a deformable sheet tensioning task. A sheet of surgical
gauze is fixtured at the two far corners using a pair of clips. The unclipped
part of the gauze is allowed to rest on soft silicone padding. The robot’s task
is to reach for the unclipped part, grasp it, lift the gauze, and tension the
sheet to be as planar as possible. An open-loop policy, one that does not react
to unexpected changes, typically fails on this task because it requires some
feedback of whether gauze is properly grasped, how the gauze has deformed after
grasping, and visual feedback of whether the gauze is planar. The task is
sequential, as some grasps pick up more or less of the material and the
flattening procedure has to be accordingly modified.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/tensioning-task.png" alt="Figure 4: Deformable Sheet Tensioning Setup." /><br />
<i>
Deformable sheet tensioning setup.
</i>
</p>
<p>We provided 15 demonstrations through a keyboard-based tele-operation interface.
The average length of the demonstrations was 48.4 actions (although we sampled
observations at a higher frequency, about 10 observations for every action).
From these 15 demonstrations, SWIRL identifies four segments. One of the
segments corresponds to moving to the correct grasping position, one corresponds
to making the grasp, one lifting the gauze up again, and one corresponds to
straightening the gauze. One of the interesting aspects of this task is that the
segmentation requires multiple features, and segmenting any single signal may
miss an important feature.</p>
<p>Then, we tried to learn a policy from the rewards constructed by SWIRL. We
define a Q-Network with a single-layer Multi-Layer Perceptron with 32 hidden
units and sigmoid activation. For each of the segments, we apply Behavioral
Cloning locally with the same architecture as the Q-network (with an additional
softmax over the output layer) to get an initial policy. We roll out 100 trials
with an $\epsilon=0.1$ greedy version of these segmented policies. The results
are depicted below:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/swirl-tensioning.png" alt="Figure 5: Deformable Sheet Tensioning Demonstration." /><br />
<i>
A representative demonstration of the deformable sheet tensioning task with
relevant features plotted over time. SWIRL identifies 4 segments which
correspond to reaching, grasping, lifting, and tensioning.
</i>
</p>
<p>SWIRL achieves more than a 4 times higher reward than ab initio RL, 3 time
higher than pure behavioral cloning, and a 56% higher reward than naively
applying behavioral cloning with TSC segments.</p>
<h1 id="hierarchical-representations">Hierarchical Representations</h1>
<p>We are now exploring a generalization of TSC and SWIRL with a new algorithm:
Deep Discovery of Continuous Options (DDCO Krishnan et al. 2017, to be presented
at the 1st Conference on Robot Learning in November).</p>
<p>An option represents a low-level policy that can be invoked by a high-level
policy to perform a certain sub-task. Formally, an option $h$ in an options set
$\mathcal H$ is specified by a control policy $\pi_h(a_t | s_t)$ and a
stochastic termination condition $\psi_h(s_t)\in[0,1]$. The high-level policy
$\eta(h_t | s_t)$ defines the distribution over options given the state. Once an
option $h$ is invoked, physical controls are selected by the option’s policy
$\pi_h$ until it terminates. After each physical control is applied and the next
state $s’$ is reached, the option $h$ terminates with probability $\psi_h(s’)$,
and if it does then the high-level policy selects a new option $h’$ with
distribution $\eta(h’ | s’)$. Thus the interaction of the hierarchical control
policy $\langle\eta,(\pi_h,\psi_h)_{h\in\mathcal H}\rangle$ with the system
induces a stochastic process over the states $s_t$, the options $h_t$, the
controls $a_t$, and the binary termination indicators $b_t$.</p>
<p>DDCO is a policy-gradient algorithm that discovers parametrized options by
fitting their parameters to maximize the likelihood of a set of demonstration
trajectories. We denote by $\theta$ the vector of all trainable parameters used
for $\eta$ and for $\pi_h$ and $\psi_h$ of each option $h\in\mathcal H$. For
example, $\theta$ can be the weights and biases of a feed-forward network that
computes these probabilities. We wish to find the $\theta\in\Theta$ that
maximizes the log-likelihood of generating each demonstration trajectory
$\xi=(s_0,a_0,s_1,\ldots,s_T)$. The challenge is that this log-likelihood
depends on the latent variables in the stochastic process, the options and the
termination indicators $\zeta = (b_0,h_0,b_1,h_1,\ldots,h_{T-1})$. DDCO
optimizes this objective with an Expectation-Gradient algorithm:</p>
<script type="math/tex; mode=display">\nabla_\theta L[\theta;\xi] = \mathbb{E}_\theta[\nabla_\theta \log \mathbb{P}_\theta(\zeta,\xi) | \xi],</script>
<p>where $\mathbb{P}_\theta(\zeta,\xi)$ is the joint probability of the latent and
observable variables, given by</p>
<script type="math/tex; mode=display">\mathbb{P}_\theta(\zeta,\xi) = p_0(s_0) \delta_{b_0=1}\eta(h_0 | s_0)
\prod_{t=1}^{T-1} \mathbb{P}_\theta(b_t, h_t | h_{t-1}, s_t) \prod_{t=0}^{T-1}
\pi_{h_t}(a_t | s_t) p(s_{t+1} |s_t, a_t) ,</script>
<p>where in the latent transition <script type="math/tex">\mathbb{P}_\theta(b_t, h_t | h_{t-1}, s_t)</script> we have
with probability $\psi_{h_{t-1}}(s_t)$ that $b_t=1$ and $h_t$ is drawn from
$\eta(\cdot|s_t)$, and otherwise that $b_t=0$ and $h_t$ is unchanged, i.e.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{P}_\theta(b_t {=} 1, h_t | h_{t-1}, s_t) &= \psi_{h_{t-1}}(s_t) \eta(h_t | s_t) \\
\mathbb{P}_\theta(b_t {=} 0, h_t | h_{t-1}, s_t) &= (1 - \psi_{h_{t-1}}(s_t)) \delta_{h_t = h_{t-1}}.
\end{align} %]]></script>
<p>The log-likelihood gradient can be computed in two steps, an E-step where the
marginal posteriors</p>
<script type="math/tex; mode=display">u_t(h) = \mathbb{P}_\theta(h_t {=} h | \xi); \quad v_t(h) = \mathbb{P}_\theta(b_t {=} 1,
h_t {=} h | \xi); \quad w_t(h) = \mathbb{P}_\theta(h_t {=} h, b_{t+1} {=} 0 | \xi)</script>
<p>are computed using a forward-backward algorithm similar to Baum-Welch, and a
G-step:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta L[\theta;\xi] &= \sum_{h\in\mathcal{H}} \Biggl( \sum_{t=0}^{T-1}
\Biggl(v_t(h) \nabla_\theta \log \eta(h | s_t) + u_t(h)\nabla_\theta
\log \pi_h(a_t | s_t)\Biggr) \\
& + \sum_{t=0}^{T-2} \Biggl((u_t(h)-w_t(h)) \nabla_\theta \log
\psi_h(s_{t+1}) + w_t(h) \nabla_\theta \log (1 - \psi_h(s_{t+1}))
\Biggr)\Biggr).
\end{align} %]]></script>
<p>The gradient computed above can then be used in any stochastic gradient descent
algorithm. In our experiments we use Adam and Momentum.</p>
<p>We evaluated DDCO in an Imitation Learning setting with surgical robotic tasks.
In one task, the robot is given a foam bin with a pile of 5–8 needles of three
different types, each 1–3mm in diameter. The robot must extract needles of a
specified type and place them in an “accept” cup, while placing all other
needles in a “reject” cup. The task is successful if the entire foam bin is
cleared into the correct cups. To define the state space for this task, we first
generate binary images from overhead stereo images, and apply a color-based
segmentation to identify the needles (the “image” input). Then, we use a
classifier trained in advance on 40 hand-labeled images to identify and provide
a candidate grasp point, specified by position and direction in image space (the
“grasp” input). Additionally, the 6 DoF robot gripper pose and the open-closed
state of the gripper are observed (the “kin” input). The state space of the
robot is (“image”, “grasp”, “kin”), and the control space is the 6 joint angles
and the gripper angle.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/dvrk-bin-picking.png" alt="Figure 6: Needle Pick and Place Task." /><br />
<i>
Needle pick and place task on the surgical robot.
</i>
</p>
<p>In 10 trials, 7/10 were successful. The main failure mode was unsuccessful
grasping due to picking either no needles or multiple needles. As the piles were
cleared and became sparser, the robot’s grasping policy became somewhat brittle.
The grasp success rate was 66% on 99 attempted grasps. In contrast, we rarely
observed failures at the other aspects of the task, reaching 97% successful
recovery on 34 failed grasps.</p>
<p>The learned options are interpretable on intuitive task boundaries. For each of
the 4 options, we plot how heavily the different inputs are weighted (image,
grasp, or kin) in computing the option’s action. Nonzero values of the ReLU
units are marked in white and indicate input relevance:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/ddco-activations.png" alt="Figure 7: DDCO Options." /><br />
<i>
We plot the average activations of the feature layer for of each option,
indicating which inputs (image, gripper angle, or kinematics ) are relevant to
the policy and termination.
</i>
</p>
<p>We see that the options are clearly specialized. The first option has a strong
dependence only on the grasp candidate, the second option attends almost
exclusively to the image, while the last two options rely mostly on the
kinematics and grasp features.</p>
<h1 id="conclusion">Conclusion</h1>
<p>To summarize, learning sequential task structure from demonstrations has many
applications in robotics such as automating surgical sub-tasks and can be
facilitated by segmenting to learning task structure. We see several avenues for
future work: (1) representations that better model rotational geometry and
configuration spaces, (2) hybrid schemes that consider both parametrized
primitives and those derived from analytic formulae, and (3) consideration of
state-space segmentation as well as temporal segmentation.</p>
<hr />
<h2 id="references">References</h2>
<p>(For links to papers, see the homepages of <a href="https://www.ocf.berkeley.edu/~sanjayk/">Sanjay
Krishnan</a> or <a href="http://goldberg.berkeley.edu/pubs/">Ken
Goldberg</a>.)</p>
<p>Sanjay Krishnan*, Roy Fox*, Ion Stoica, Ken Goldberg. DDCO: Discovery of Deep
Continuous Options for Robot Learning from Demonstrations. Conference on Robot
Learning (CoRL). 2017.</p>
<p>Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller,
Florian T. Pokorny, Ken Goldberg. SWIRL: A Sequential Windowed Inverse
Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards. Workshop
on Algorithmic Foundations of Robotics (WAFR) 2016.</p>
<p>Sanjay Krishnan*, Animesh Garg*, Sachin Patil, Colin Lea, Gregory Hager,
Pieter Abbeel, Ken Goldberg. Transition State Clustering: Unsupervised Surgical
Task Segmentation For Robot Learning. International Symposium on Robotics
Research (ISRR). 2015.</p>
<p>Adithyavairavan Murali*, Siddarth Sen*, Ben Kehoe, Animesh Garg, Seth McFarland,
Sachin Patil, W. Douglas Boyd, Susan Lim, Pieter Abbeel, Ken Goldberg. Learning
by Observation for Surgical Subtasks: Multilateral Cutting of 3D Viscoelastic
and 2D Orthotropic Tissue Phantoms. International Conference on Robotics and
Automation (ICRA). May 2015.</p>
<h2 id="external-references">External References</h2>
<p>Richard Sutton. Temporal credit assignment in reinforcement learning. 1984.</p>
<p>Richard Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial
intelligence. 1999.</p>
<p>Lerrel Pinto, and Abhinav Gupta. Supersizing self-supervision: Learning to grasp
from 50k tries and 700 robot hours. International Conference on Robotics and
Automation (ICRA). 2016.</p>
<p>Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, Deirdre Quillen.
Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and
Large-Scale Data Collection (International Journal of Robotics Research). 2017.</p>
<p>Sergey Levine*, Chelsea Finn*, Trevor Darrell, and Pieter Abbeel. End-to-end
training of deep visuomotor policies. Journal of Machine Learning Research
(JMLR). 2016.</p>
Tue, 17 Oct 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/10/17/lfd-surgical-robots/
http://bair.berkeley.edu/blog/2017/10/17/lfd-surgical-robots/Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning<p>Deep reinforcement learning (deep RL) has achieved success in many tasks, such as playing video games from raw pixels (Mnih et al., 2015), playing the game of Go (Silver et al., 2016), and simulated robotic locomotion (e.g. Schulman et al., 2015). Standard deep RL algorithms aim to master a single way to solve a given task, typically the first way that seems to work well. Therefore, training is sensitive to randomness in the environment, initialization of the policy, and the algorithm implementation. This phenomenon is illustrated in Figure 1, which shows two policies trained to optimize a reward function that encourages forward motion: while both policies have converged to a high-performing gait, these gaits are substantially different from each other.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_1_walker_two_gaits_v2.gif" alt="Figure 1: Trained simulated walking robots." /><br />
<i>
Figure 1: Trained simulated walking robots.<br />
[credit: John Schulman and Patrick Coady (<a href="https://gym.openai.com/envs/Walker2d-v1/">OpenAI Gym)</a>]
</i>
</p>
<p>Why might finding only a single solution be undesirable? Knowing only one way to act makes agents vulnerable to environmental changes that are common in the real-world. For example, consider a robot (Figure 2) navigating its way to the goal (blue cross) in a simple maze. At training time (Figure 2a), there are two passages that lead to the goal. The agent will likely commit to the solution via the upper passage as it is slightly shorter. However, if we change the environment by blocking the upper passage with a wall (Figure 2b), the solution the agent has found becomes infeasible. Since the agent focused entirely on the upper passage during learning, it has almost no knowledge of the lower passage. Therefore, adapting to the new situation in Figure 2b requires the agent to relearn the entire task from scratch.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_2a_maze_one_path.png" alt="maze_one_path" width="300" /><p class="center">2a</p>
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_2b_maze-two-paths.png" alt="maze-two-paths" width="300" /><p class="center">2b</p>
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
Figure 2: A robot navigating a maze.
</i>
</p>
<!--more-->
<h3 id="maximum-entropy-policies-and-their-energy-forms">Maximum Entropy Policies and Their Energy Forms</h3>
<p>Let us begin with a review of RL: an agent interacts with an environment by iteratively observing the current <em>state</em> ($\mathbf{s}$), taking an <em>action</em> ($\mathbf{a}$), and receiving a <em>reward</em> ($r$). It employs a (stochastic) policy ($\pi$) to select actions, and finds the best policy that maximizes the cumulative reward it collects throughout an episode of length $T$:</p>
<script type="math/tex; mode=display">\pi^* = \arg\!\max_{\pi} \mathbb{E}_{\pi}\left[ \sum_{t=0}^T r_t \right]</script>
<p>We define the Q-function, $Q(\mathbf{s},\mathbf{a})$, as the expected cumulative reward after taking action a at state s. Consider the robot in Figure 2a again. When the robot is in the initial state, the Q-function may look like the one depicted in Figure 3a (grey curve), with two distinct modes corresponding to the two passages. A conventional RL approach is to specify a unimodal policy distribution, centered at the maximal Q-value and extending to the neighbouring actions to provide noise for exploration (red distribution). Since the exploration is biased towards the upper passage, the agent refines its policy there and ignores the lower passage completely.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_3a_unimodal-policy.png" alt="unimodal-policy" width="300" /><p class="center">3a</p>
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_3b_multimodal_policy.png" alt="multimodal_policy" width="300" /><p class="center">3b</p>
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
Figure 3: A multimodal Q-function.
</i>
</p>
<p>An obvious solution, at the high level, is to ensure the agent explores all promising states while prioritizing the more promising ones. One way to formalize this idea is to define the policy directly in terms of exponentiated Q-values (Figure 3b, green distribution):</p>
<script type="math/tex; mode=display">\pi(\mathbf{a}|\mathbf{s}) \propto \exp Q(\mathbf{s}, \mathbf{a})</script>
<p>This density has the form of the Boltzmann distribution, where the Q-function serves as the negative energy, which assigns a non-zero likelihood to all actions. As a consequence, the agent will become aware of all behaviours that lead to solving the task, which can help the agent adapt to changing situations in which some of the solutions might have become infeasible. In fact, we can show that the policy defined through the energy form is an optimal solution for the maximum-entropy RL objective</p>
<script type="math/tex; mode=display">\pi_{\mathrm{MaxEnt}}^* = \arg\!\max_{\pi} \mathbb{E}_{\pi}\left[ \sum_{t=0}^T r_t + \mathcal{H}(\pi(\cdot | \mathbf{s}_t)) \right]</script>
<p>which simply augments the conventional RL objective with the entropy of the policy (Ziebart 2010).</p>
<p>The idea of learning such <a href="https://en.wikipedia.org/wiki/Principle_of_maximum_entropy">maximum entropy models</a> has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics. For example, if the distribution is on the Euclidean space and the observed statistics are the mean and the covariance, then the maximum entropy distribution is a Gaussian with the corresponding mean and covariance. In practice, we prefer maximum-entropy models as they assume the least about the unknowns while matching the observed information.</p>
<p>A number of prior works have employed the maximum-entropy principle in the context of reinforcement learning and optimal control. Ziebart (2008) used the maximum entropy principle to resolve ambiguities in inverse reinforcement learning, where several reward functions can explain the observed demonstrations. Several works (Todorov 2008; Toussaint, 2009]) have studied the connection between inference and control via the maximum entropy formulation. Todorov (2007, 2009) also showed how the maximum entropy principle can be employed to make MDPs linearly solvable, and Fox et al. (2016) utilized the principle as a means to incorporate prior knowledge into a reinforcement learning policy.</p>
<h2 id="soft-bellman-equation-and-soft-q-learning">Soft Bellman Equation and Soft Q-Learning</h2>
<p>We can obtain the optimal solution of the maximum entropy objective by employing the <em>soft Bellman equation</em></p>
<script type="math/tex; mode=display">Q(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}\left[r_t + \gamma\ \mathrm{softmax}_{\mathbf{a}} Q(\mathbf{s}_{t+1}, \mathbf{a})\right]</script>
<p>where</p>
<script type="math/tex; mode=display">\mathrm{softmax}_{\mathbf{a}} f(\mathbf{a}) := \log \int \exp f(\mathbf{a}) \, d\mathbf{a}</script>
<p>The soft Bellman equation can be shown to hold for the optimal Q-function of the entropy augmented reward function (e.g. Ziebart 2010). Note the similarity to the conventional Bellman equation, which instead has the hard max of the Q-function over the actions instead of the softmax. Like the hard version, the soft Bellman equation is a contraction, which allows solving for the Q-function using dynamic programming or model-free TD learning in tabular state and action spaces (e.g. Ziebart, 2008; Rawlik, 2012; Fox, 2016).</p>
<p>However, in continuous domains, there are two major challenges. First, exact dynamic programming is infeasible, since the soft Bellman equation needs to hold for every state and action, and the softmax involves integrating over the entire action space. Second, the optimal policy is defined by an intractable energy-based distribution, which is difficult to sample from. To address the first challenge, we can employ expressive neural network function approximators, which can be trained with stochastic gradient descent on sampled states and actions and then generalize effectively to new state-action tuples. To address the second challenge, we can employ approximate inference techniques, such as Markov chain Monte Carlo, which has been explored in prior works for energy-based policies (Heess, 2012). To accelerate inference, we use the amortized Stein variational gradient descent (Wang and Liu, 2016) to train an inference network to generate approximate samples. The resulting algorithm, termed <em>soft Q-learning</em>, combines deep Q-learning and the amortized Stein variational gradient descent.</p>
<h2 id="application-to-reinforcement-learning">Application to Reinforcement Learning</h2>
<p>Now that we can learn maximum entropy policies via soft Q-learning, we might wonder: what are the practical uses of this approach? In the following sections, we illustrate with experiments that soft Q-learning allows for better exploration, enables policy transfer between similar tasks, allows new policies to be easily composed from existing policies, and improves robustness through extensive exploration at training time.</p>
<h3 id="better-exploration">Better Exploration</h3>
<p>Soft Q-learning (SQL) provides us with an implicit exploration strategy by assigning each action a non-zero probability, shaped by the current belief about its value, effectively combining exploration and exploitation in a natural way. To see this, let us consider a two-passage maze (Figure 4) similar to the one discussed in the introduction. The task is to find a way to the goal state, denoted by a blue square. Suppose that the reward is proportional to the distance to the goal. Since the maze is almost symmetric, such a reward results in a bimodal objective, but only one of the modes corresponds to an actual solution to the task. Thus, exploring both passages at training time is crucial to discover which of the two is really best. A unimodal policy can only solve this task if it is lucky enough to commit to the lower passage from the start. On the other hand, a multimodal soft Q-learning policy can solve the task consistently by following both passages randomly until the agent finds the goal (Figure 4).</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_4_ant_maze.gif" alt="A policy trained with soft Q-learning." /><br />
<i>
Figure 4: A policy trained with soft Q-learning can explore both passages during training.
</i>
</p>
<h3 id="fine-tuning-maximum-entropy-policies">Fine-Tuning Maximum Entropy Policies</h3>
<p>The standard practice in RL is to train an agent from scratch for each new task. This can be slow because the agent throws away knowledge acquired from previous tasks. Instead, the agent can transfer skills from similar previous tasks, allowing it to learn new tasks more quickly. One way to transfer skills is to pre-train policies for general purpose tasks, and then use them as templates or initializations for more specific tasks. For example, the skill of walking subsumes the skill of navigating through a maze, and therefore the walking skill can serve as an efficient initialization for learning the navigation skill. To illustrate this idea, we trained a maximum entropy policy by rewarding the agent for walking at a high speed, regardless of the direction. The resulting policy learns to walk, but does not commit to any single direction due to the maximum entropy objective (Figure 5a). Next, we specialized the walking skill to a range of navigation skills, such as the one in Figure 5b. In the new task, the agent only needs to choose which walking behavior will move itself closer to the goal, which is substantially easier than learning the same skill from scratch. A conventional policy would converge to a specific behaviour when trained for the general task. For example, it may only learn to walk in a single direction. Consequently, it cannot directly transfer the walking skill to the maze environment, which requires movement in multiple directions.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_5a_pretrain_softql_small.gif" alt="pretrain_softql_small" width="200" /><p class="center">5a</p>
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_5b_finetune_ushape_2.gif" alt="finetune_ushape_2" width="350" /><p class="center">5b</p>
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
Figure 5: Maximum entropy pretraining allows agents to learn more quickly in new environments. Videos of the same pretrained policy fine-tuned for other target tasks can be found at <a href="https://www.youtube.com/watch?v=7Nm1N6sUoVs&feature=youtu.be">this</a> link.
</i>
</p>
<h3 id="compositionality">Compositionality</h3>
<p>In a similar vein to general-to-specific transfer, we can compose new skills from existing policies—even without any fine-tuning—by intersecting different skills. The idea is simple: take two soft policies, each corresponding to a different set of behaviors, and combine them by adding together their Q-functions. In fact, it is possible to show that the combined policy is approximately optimal for the combined task, obtained by simply adding the reward functions of the constituent tasks, up to a bounded error. Consider a planar manipulator as the one pictured below. The two agents on the left are trained to move the cylindrical object to a target location illustrated with red stripes. Note, how the solution space of the two tasks overlap: by moving the cylinder to the intersection of the stripes, both tasks can be solved simultaneously. Indeed, the policy on the right, which is obtained by simply summing together the two Q-functions, moves the cylinder to the intersection, without the need to train a policy explicitly for the combined task. Conventional policies do not exhibit the same compositionality property as they can only represent specific, disjoint solutions.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_6_composition_small.gif" alt="Combining two skills into a new one" /><br />
<i>
Figure 6: Combining two skills into a new one.
</i>
</p>
<h3 id="robustness">Robustness</h3>
<p>Because the maximum entropy formulation encourages agents to try all possible solutions, the agents learn to explore a large portion of the state space. Thus they learn to act in various situations, and are more robust against perturbations in the environment. To illustrate this, we trained a Sawyer robot to stack Lego blocks together by specifying a target end-effector pose. Figure 7 shows some snapshots during training.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_7_sawyer_training_white_bg.gif" alt="Training to stack Lego blocks with soft Q-learning." /><br />
<i>
Figure 7: Training to stack Lego blocks with soft Q-learning.<br />
[credit: Aurick Zhou]
</i>
</p>
<p>The robot succeeded for the first time after 30 minutes; after an hour, it was able to stack the blocks consistently; and after two hours, the policy had fully converged. The converged policy is also robust to perturbations as shown in the video below, in which the arm is perturbed into configurations that are very different from what it encounters during normal execution, and it is able to successfully recover every time.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_8_sawyer_bully_policy.gif" alt="The trained policy is robust to perturbations" /><br />
<i>
Figure 8: The trained policy is robust to perturbations.
</i>
</p>
<h2 id="related-work">Related Work</h2>
<p>Soft optimality has also been studied in recent papers in the context of learning from multi-step transitions (Nachum et al., 2017) and its connection to policy gradient methods (Schulman et al., 2017). A related concept is discussed by O’Donoghue et al. (2016), who also consider entropy regularization and Boltzmann exploration. This version of entropy regularization only considers the entropy of the current state, and does not take into account the entropy for the future states.</p>
<p>To our knowledge, only a few prior works have demonstrated successful model-free reinforcement learning directly on real-world robots. Gu et al. (2016) showed that NAF could learn door opening tasks, using about 2.5 hours of experience parallelized across two robots. Rusu et al. (2016) used RL to train a robot arm to reach a red square, with pretraining in simulation. Večerı́k et al. (2017) showed that, if initialized from demonstration, a Sawyer robot could perform a peg-insertion style task with about 30 minutes of experience. It is worth noting that our soft Q-learning results, shown above, used only a single robot for training, and did not use any simulation or demonstrations.</p>
<hr />
<p>We would like to thank Sergey Levine, Pieter Abbeel, and Gregory Kahn for their valuable feedback when preparing this blog post.</p>
<p>This post is based on the following paper: <br />
<strong>Reinforcement Learning with Deep Energy-Based Policies</strong> <br />
Haarnoja T., Tang H., Abbeel P., Levine S. <em>ICML 2017</em>.<br />
<a href="https://arxiv.org/abs/1702.08165">paper</a>, <a href="https://github.com/haarnoja/softqlearning">code</a>, <a href="https://sites.google.com/view/softqlearning/home">videos</a></p>
<h2 id="references">References</h2>
<p><strong>Related concurrent papers</strong></p>
<ul>
<li>Schulman, J., Abbeel, P. and Chen, X. Equivalence Between Policy Gradients and Soft Q-Learning. <em>arXiv preprint arXiv:1704.06440</em>, 2017.</li>
<li>Nachum, O., Norouzi, M., Xu, K. and Schuurmans, D. Bridging the Gap Between Value and Policy Based Reinforcement Learning. <em>NIPS 2017</em>.</li>
</ul>
<p><strong>Papers leveraging the maximum entropy principle</strong></p>
<ul>
<li>Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. <em>Journal of Statistical Mechanics: Theory And Experiment</em>, 2005(11): P11011, 2005.</li>
<li>Todorov, E. Linearly-solvable Markov decision problems. In <em>Advances in Neural Information Processing Systems</em>, pp. 1369–1376. MIT Press, 2007.</li>
<li>Todorov, E. General duality between optimal control and estimation. In IEEE Conf. on Decision and Control, pp. 4286–4292. IEEE, 2008.</li>
<li>Todorov, E. (2009). Compositionality of optimal control laws. In <em>Advances in Neural Information Processing Systems</em> (pp. 1856-1864).</li>
<li>Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 1433–1438, 2008.</li>
<li>Toussaint, M. Robot trajectory optimization using approximate inference. In <em>Int. Conf. on Machine Learning</em>, pp. 1049–1056. ACM, 2009.</li>
<li>Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, 2010.</li>
<li>Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. <em>Proceedings of Robotics: Science and Systems VIII</em>, 2012.</li>
<li>Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In <em>Conf. on Uncertainty in Artificial Intelligence</em>, 2016.</li>
</ul>
<p><strong>Model-free RL in the real-world</strong></p>
<ul>
<li>Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep Q-learning with model-based acceleration. In Int. Conf. on Machine Learning, pp. 2829–2838, 2016.</li>
<li>M. Večerı́k, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” <em>arXiv preprint arXiv:1707.08817</em>, 2017.</li>
</ul>
<p><strong>Other references</strong><br />
Heess, N., Silver, D., and Teh, Y.W. Actor-critic reinforcement learning with energy-based policies. In Workshop on Reinforcement Learning, pp. 43. Citeseer, 2012.</p>
<p>Jaynes, E.T. Prior probabilities. IEEE Transactions on systems science and cybernetics, 4(3), pp.227-241, 1968.</p>
<p>Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ICLR 2016.</p>
<p>Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In <em>Advances In Neural Information Processing Systems</em>, pp. 2370–2378, 2016.</p>
<p>Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A, Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. <em>Nature</em>, 518 (7540):529–533, 2015.</p>
<p>Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In <em>International Conference on Machine Learning</em> (pp. 1928-1937), 2016.</p>
<p>O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ: Combining policy gradient and Q-learning. <em>arXiv preprint arXiv:1611.01626</em>, 2016.</p>
<p>Rusu, A. A., Vecerik, M., Rothörl, T., Heess, N., Pascanu, R. and Hadsell, R., Sim-to-real robot learning from pixels with progressive nets. <em>arXiv preprint arXiv:1610.04286</em>, 2016.</p>
<p>Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015.</p>
<p>Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S. Mastering the game of Go with deep neural networks and tree search. <em>Nature</em>, 529(7587), 484-489, 2016.</p>
<p>Sutton, R. S. and Barto, A. G. <em>Reinforcement learning: An introduction</em>, volume 1. MIT press Cambridge, 1998.</p>
<p>Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. and Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. <em>arXiv preprint arXiv:1703.06907</em>, 2017.</p>
<p>Wang, D., and Liu, Q. Learning to draw samples: With application to amortized MLE for generative adversarial learning. <em>arXiv preprint arXiv:1611.01722</em>, 2016.</p>
<!-- [Mnih2015]:
[Silver2016]:
[Schulman2015]:
[ZieBart2010]: -->
<!-- [Ziebart2008]:
[Todorov2008]:
[Toussaint2009]:
[Todorov2007]:
[Todorov2009]:
[Fox2016]:
[Rawlik2012]:
[Heess2012]:
[WangLiu2016]:
[Nachum2017]:
[Schulman2017]:
[ODonoghue2016]:
[Gu2016]:
[Rusu2016]:
[Vecerik2017]: -->
Fri, 06 Oct 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/
http://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/Learning to Optimize with Reinforcement Learning<p><em>Since we posted our paper on “<a href="https://arxiv.org/abs/1606.01885">Learning to Optimize</a>” last year, the area of optimizer learning has received growing attention. In this article, we provide an introduction to this line of work and share our perspective on the opportunities and challenges in this area.</em></p>
<p>Machine learning has enjoyed tremendous success and is being applied to a wide variety of areas, both in AI and beyond. This success can be attributed to the data-driven philosophy that underpins machine learning, which favours automatic discovery of patterns from data over manual design of systems using expert knowledge.</p>
<p>Yet, there is a paradox in the current paradigm: the algorithms that power machine learning are still designed manually. This raises a natural question: can we <em>learn</em> these algorithms instead? This could open up exciting possibilities: we could find new algorithms that perform better than manually designed algorithms, which could in turn improve learning capability.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/teaser.png" alt="The learned optimizer could potentially pick better update steps than traditional optimizers." />
</p>
<!--more-->
<p>Doing so, however, requires overcoming a fundamental obstacle: how do we parameterize the space of algorithms so that it is both (1) expressive, and (2) efficiently searchable? Various ways of representing algorithms trade off these two goals. For example, if the space of algorithms is represented by a small set of known algorithms, it most likely does not contain the best possible algorithm, but does allow for efficient searching via simple enumeration of algorithms in the set. On the other hand, if the space of algorithms is represented by the set of all possible programs, it contains the best possible algorithm, but does not allow for efficient searching, as enumeration would take exponential time.</p>
<p>One of the most common types of algorithms used in machine learning is continuous optimization algorithms. Several popular algorithms exist, including gradient descent, momentum, AdaGrad and ADAM. We consider the problem of automatically designing such algorithms. Why do we want to do this? There are two reasons: first, many optimization algorithms are devised under the assumption of convexity and applied to non-convex objective functions; by learning the optimization algorithm under the same setting as it will actually be used in practice, the learned optimization algorithm could hopefully achieve better performance. Second, devising new optimization algorithms manually is usually laborious and can take months or years; learning the optimization algorithm could reduce the amount of manual labour.</p>
<h2 id="-learning-to-optimize"><a name="framework"></a> Learning to Optimize</h2>
<p>In our paper last year (<a href="https://arxiv.org/abs/1606.01885">Li & Malik, 2016</a>), we introduced a framework for learning optimization algorithms, known as “Learning to Optimize”. We note that soon after our paper appeared, (<a href="https://arxiv.org/abs/1606.04474">Andrychowicz et al., 2016</a>) also independently proposed a similar idea.</p>
<p>Consider how existing continuous optimization algorithms generally work. They operate in an iterative fashion and maintain some iterate, which is a point in the domain of the objective function. Initially, the iterate is some random point in the domain; in each iteration, a step vector is computed using some fixed update formula, which is then used to modify the iterate. The update formula is typically some function of the history of gradients of the objective function evaluated at the current and past iterates. For example, in gradient descent, the update formula is some scaled negative gradient; in momentum, the update formula is some scaled exponential moving average of the gradients.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/alg_structure.png" alt="Optimization algorithms start at a random point and iteratively update it with a step vector computed using a fixed update formula." />
</p>
<p>What changes from algorithm to algorithm is this update formula. So, if we can learn the update formula, we can learn an optimization algorithm. We model the update formula as a neural net. Thus, by learning the weights of the neural net, we can learn an optimization algorithm. Parameterizing the update formula as a neural net has two appealing properties mentioned earlier: first, it is expressive, as neural nets are universal function approximators and can in principle model any update formula with sufficient capacity; second, it allows for efficient search, as neural nets can be trained easily with backpropagation.</p>
<p>In order to learn the optimization algorithm, we need to define a performance metric, which we will refer to as the “meta-loss”, that rewards good optimizers and penalizes bad optimizers. Since a good optimizer converges quickly, a natural meta-loss would be the sum of objective values over all iterations (assuming the goal is to minimize the objective function), or equivalently, the cumulative regret. Intuitively, this corresponds to the area under the curve, which is larger when the optimizer converges slowly and smaller otherwise.</p>
<h2 id="learning-to-learn">Learning to Learn</h2>
<p>Consider the special case when the objective functions are loss functions for training other models. Under this setting, optimizer learning can be used for “learning to learn”. For clarity, we will refer to the model that is trained using the optimizer as the “base-model” and prefix common terms with “base-” and “meta-” to disambiguate concepts associated with the base-model and the optimizer respectively.</p>
<p>What do we mean exactly by “learning to learn”? While this term has appeared from time to time in the literature, different authors have used it to refer to different things, and there is no consensus on its precise definition. Often, it is also used interchangeably with the term “meta-learning”.</p>
<p>The term traces its origins to the idea of metacognition (<a href="http://classics.mit.edu/Aristotle/soul.html">Aristotle, 350 BC</a>), which describes the phenomenon that humans not only reason, but also reason about their own process of reasoning. Work on “learning to learn” draws inspiration from this idea and aims to turn it into concrete algorithms. Roughly speaking, “learning to learn” simply means learning <em>something</em> about learning. What is learned at the meta-level differs across methods. We can divide various methods into three broad categories according to the type of meta-knowledge they aim to learn:</p>
<ul>
<li>Learning <em>What</em> to Learn</li>
<li>Learning <em>Which Model</em> to Learn</li>
<li>Learning <em>How</em> to Learn</li>
</ul>
<h3 id="learning-what-to-learn">Learning <em>What</em> to Learn</h3>
<p>These methods aim to learn some particular values of base-model parameters that are useful across a family of related tasks (<a href="https://books.google.com/books?isbn=1461555299">Thrun & Pratt, 2012</a>). The meta-knowledge captures commonalities across the family, so that base-learning on a new task from the family can be done more quickly. Examples include methods for transfer learning, multi-task learning and few-shot learning. Early methods operate by partitioning the parameters of the base-model into two sets: those that are specific to a task and those that are common across tasks. For example, a popular approach for neural net base-models is to share the weights of the lower layers across all tasks, so that they capture the commonalities across tasks. See <a href="/blog/2017/07/18/learning-to-learn/">this post</a> by Chelsea Finn for an overview of the more recent methods in this area.</p>
<h3 id="learning-which-model-to-learn">Learning <em>Which Model</em> to Learn</h3>
<p>These methods aim to learn which base-model is best suited for a task (<a href="https://books.google.com/books?isbn=3540732632">Brazdil et al., 2008</a>). The meta-knowledge captures correlations between different base-models and their performance on different tasks. The challenge lies in parameterizing the space of base-models in a way that is expressive and efficiently searchable, and in parameterizing the space of tasks that allows for generalization to unseen tasks. Different methods make different trade-offs between expressiveness and searchability: (<a href="https://link.springer.com/article/10.1023/A:1021713901879">Brazdil et al., 2003</a>) uses a database of predefined base-models and exemplar tasks and outputs the base-model that performed the best on the nearest exemplar task. While this space of base-models is searchable, it does not contain good but yet-to-be-discovered base-models. (<a href="https://link.springer.com/article/10.1023/B:MACH.0000015880.99707.b2">Schmidhuber, 2004</a>) represents each base-model as a general-purpose program. While this space is very expressive, searching in this space takes exponential time in the length of the target program. (<a href="https://link.springer.com/chapter/10.1007/3-540-44668-0_13">Hochreiter et al., 2001</a>) views an algorithm that trains a base-model as a black box function that maps a sequence of training examples to a sequence of predictions and models it as a recurrent neural net. Meta-training then simply reduces to training the recurrent net. Because the base-model is encoded in the recurrent net’s memory state, its capacity is constrained by the memory size. A related area is hyperparameter optimization, which aims for a weaker goal and searches over base-models parameterized by a predefined set of hyperparameters. It needs to generalize across hyperparameter settings (and by extension, base-models), but not across tasks, since multiple trials with different hyperparameter settings on the same task are allowed.</p>
<h3 id="learning-how-to-learn">Learning <em>How</em> to Learn</h3>
<p>While methods in the previous categories aim to learn about the <em>outcome</em> of learning, methods in this category aim to learn about the <em>process</em> of learning. The meta-knowledge captures commonalities in the behaviours of learning algorithms. There are three components under this setting: the base-model, the base-algorithm for training the base-model, and the meta-algorithm that learns the base-algorithm. What is learned is not the base-model itself, but the base-algorithm, which trains the base-model on a task. Because both the base-model and the task are given by the user, the base-algorithm that is learned must work on a range of different base-models and tasks. Since most learning algorithms optimize some objective function, learning the base-algorithm in many cases reduces to learning an optimization algorithm. This problem of learning optimization algorithms was explored in (<a href="https://arxiv.org/abs/1606.01885">Li & Malik, 2016</a>), (<a href="https://arxiv.org/abs/1606.04474">Andrychowicz et al., 2016</a>) and a number of subsequent papers. Closely related to this line of work is (<a href="http://ieeexplore.ieee.org/abstract/document/155621">Bengio et al., 1991</a>), which learns a Hebb-like synaptic learning rule. The learning rule depends on a subset of the dimensions of the current iterate encoding the activities of neighbouring neurons, but does not depend on the objective function and therefore does not have the capability to generalize to different objective functions.</p>
<h2 id="generalization">Generalization</h2>
<p>Learning of any sort requires training on a finite number of examples and generalizing to the broader class from which the examples are drawn. It is therefore instructive to consider what the examples and the class correspond to in our context of learning optimizers for training base-models. Each example is an objective function, which corresponds to the loss function for training a base-model on a task. The task is characterized by a set of examples and target predictions, or in other words, a dataset, that is used to train the base-model. The meta-training set consists of multiple objective functions and the meta-test set consists of different objective functions drawn from the same class. Objective functions can differ in two ways: they can correspond to different base-models, or different tasks. Therefore, generalization in this context means that the learned optimizer works on different base-models and/or different tasks.</p>
<h3 id="why-is-generalization-important">Why is generalization important?</h3>
<p>Suppose for moment that we didn’t care about generalization. In this case, we would evaluate the optimizer on the same objective functions that are used for training the optimizer. If we used only one objective function, then the best optimizer would be one that simply memorizes the optimum: this optimizer always converges to the optimum in one step regardless of initialization. In our context, the objective function corresponds to the loss for training a particular base-model on a particular task and so this optimizer essentially memorizes the optimal weights of the base-model. Even if we used many objective functions, the learned optimizer could still try to identify the objective function it is operating on and jump to the memorized optimum as soon as it does.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/memorization.png" alt="At training time, the optimizer can memorize the optimum. At test time, it can jump directly to the optimum." />
</p>
<p>Why is this problematic? Memorizing the optima requires finding them in the first place, and so learning an optimizer takes longer than running a traditional optimizer like gradient descent. So, for the purposes of finding the optima of the objective functions at hand, running a traditional optimizer would be faster. Consequently, it would be pointless to learn the optimizer if we didn’t care about generalization.</p>
<p>Therefore, for the learned optimizer to have any practical utility, it must perform well on new objective functions that are different from those used for training.</p>
<h3 id="what-should-be-the-extent-of-generalization">What should be the extent of generalization?</h3>
<p>If we only aim for generalization to <em>similar</em> base-models on <em>similar</em> tasks, then the learned optimizer could memorize parts of the optimal weights that are common across the base-models and tasks, like the weights of the lower layers in neural nets. This would be essentially the same as learning-<em>what</em>-to-learn formulations like transfer learning.</p>
<p>Unlike learning <em>what</em> to learn, the goal of learning <em>how</em> to learn is to learn not what the optimum is, but how to find it. We must therefore aim for a stronger notion of generalization, namely generalization to similar base-models on dissimilar tasks. An optimizer that can generalize to <em>dissimilar</em> tasks cannot just partially memorize the optimal weights, as the optimal weights for dissimilar tasks are likely completely different. For example, not even the lower layer weights in neural nets trained on MNIST(a dataset consisting of black-and-white images of handwritten digits) and CIFAR-10(a dataset consisting of colour images of common objects in natural scenes) likely have anything in common.</p>
<p>Should we aim for an even stronger form of generalization, that is, generalization to <em>dissimilar</em> base-models on dissimilar tasks? Since these correspond to objective functions that bear no similarity to objective functions used for training the optimizer, this is essentially asking if the learned optimizer should generalize to objective functions that could be arbitrarily different.</p>
<p>It turns out that this is impossible. Given any optimizer, we consider the trajectory followed by the optimizer on a particular objective function. Because the optimizer only relies on information at the previous iterates, we can modify the objective function at the last iterate to make it arbitrarily bad while maintaining the geometry of the objective function at all previous iterates. Then, on this modified objective function, the optimizer would follow the exact same trajectory as before and end up at a point with a bad objective value. Therefore, any optimizer has objective functions that it performs poorly on and no optimizer can generalize to all possible objective functions.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/impossibility.png" alt="Take any optimizer and run it on some objective function. We can always manipulate the objective function by making the objective value at the last iteration arbitrarily high, while maintaining the geometry at all previous iterations. The same optimizer must perform poorly on this new objective function." />
</p>
<p>If no optimizer is universally good, can we still hope to learn optimizers that are useful? The answer is yes: since we are typically interested in optimizing functions from certain special classes in practice, it is possible to learn optimizers that work well on these classes of interest. The objective functions in a class can share regularities in their geometry, e.g.: they might have in common certain geometric properties like convexity, piecewise linearity, Lipschitz continuity or other unnamed properties. In the context of learning-<em>how</em>-to-learn, each class can correspond to a type of base-model. For example, neural nets with ReLU activation units can be one class, as they are all piecewise linear. Note that when learning the optimizer, there is no need to explicitly characterize the form of geometric regularity, as the optimizer can learn to exploit it automatically when trained on objective functions from the class.</p>
<h2 id="how-to-learn-the-optimizer">How to Learn the Optimizer</h2>
<p>The first approach we tried was to treat the problem of learning optimizers as a standard supervised learning problem: we simply differentiate the meta-loss with respect to the parameters of the update formula and learn these parameters using standard gradient-based optimization. (We weren’t the only ones to have thought of this; (<a href="https://arxiv.org/abs/1606.04474">Andrychowicz et al., 2016</a>) also used a similar approach.)</p>
<p>This seemed like a natural approach, but it did not work: despite our best efforts, we could not get any optimizer trained in this manner to generalize to unseen objective functions, even though they were drawn from the same distribution that generated the objective functions used to train the optimizer. On almost all unseen objective functions, the learned optimizer started off reasonably, but quickly diverged after a while. On the other hand, on the training objective functions, it exhibited no such issues and did quite well. Why is this?</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/sl_performance.png" alt="An optimizer trained using supervised learning initially does reasonably well, but diverges in later iterations." />
</p>
<p>It turns out that optimizer learning is not as simple a learning problem as it appears. Standard supervised learning assumes all training examples are independent and identically distributed (i.i.d.); in our setting, the step vector the optimizer takes at any iteration affects the gradients it sees at all subsequent iterations. Furthermore, <em>how</em> the step vector affects the gradient at the subsequent iteration is not known, since this depends on the local geometry of the objective function, which is unknown at meta-test time. Supervised learning cannot operate in this setting, and must assume that the local geometry of an unseen objective function is the same as the local geometry of training objective functions at all iterations.</p>
<p>Consider what happens when an optimizer trained using supervised learning is used on an unseen objective function. It takes a step, and discovers at the next iteration that the gradient is different from what it expected. It then recalls what it did on the training objective functions when it encountered such a gradient, which could have happened in a completely different region of the space, and takes a step accordingly. To its dismay, it finds out that the gradient at the next iteration is even more different from what it expected. This cycle repeats and the error the optimizer makes becomes bigger and bigger over time, leading to rapid divergence.</p>
<p>This phenomenon is known in the literature as the problem of <em>compounding errors</em>. It is known that the total error of a supervised learner scales quadratically in the number of iterations, rather than linearly as would be the case in the i.i.d. setting (<a href="http://proceedings.mlr.press/v9/ross10a.html">Ross and Bagnell, 2010</a>). In essence, an optimizer trained using supervised learning necessarily overfits to the geometry of the training objective functions. One way to solve this problem is to use reinforcement learning.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/rl_performance.png" alt="An optimizer trained using reinforcement learning does not diverge in later iterations." />
</p>
<h2 id="background-on-reinforcement-learning">Background on Reinforcement Learning</h2>
<p>Consider an environment that maintains a state, which evolves in an unknown fashion based on the action that is taken. We have an agent that interacts with this environment, which sequentially selects actions and receives feedback after each action is taken on how good or bad the new state is. The goal of reinforcement learning is to find a way for the agent to pick actions based on the current state that leads to good states on average.</p>
<p>More precisely, a reinforcement learning problem is characterized by the following components:</p>
<ul>
<li>A state space, which is the set of all possible states,</li>
<li>An action space, which is the set of all possible actions,</li>
<li>A cost function, which measures how bad a state is,</li>
<li>A time horizon, which is the number of time steps,</li>
<li>An initial state probability distribution, which specifies how frequently different states occur at the beginning before any action is taken, and</li>
<li>A state transition probability distribution, which specifies how the state changes (probabilistically) after a particular action is taken.</li>
</ul>
<p>While the learning algorithm is aware of what the first five components are, it does not know the last component, i.e.: how states evolve based on actions that are chosen. At training time, the learning algorithm is allowed to interact with the environment. Specifically, at each time step, it can choose an action to take based on the current state. Then, based on the action that is selected and the current state, the environment samples a new state, which is observed by the learning algorithm at the subsequent time step. The sequence of sampled states and actions is known as a trajectory. This sampling procedure induces a distribution over trajectories, which depends on the initial state and transition probability distributions and the way action is selected based on the current state, the latter of which is known as a <em>policy</em>. This policy is often modelled as a neural net that takes in the current state as input and outputs the action. The goal of the learning algorithm is to find a policy such that the expected cumulative cost of states over all time steps is minimized, where the expectation is taken with respect to the distribution over trajectories.</p>
<h2 id="formulation-as-a-reinforcement-learning-problem">Formulation as a Reinforcement Learning Problem</h2>
<p>Recall the <a href="#framework">learning framework</a> we introduced above, where the goal is to find the update formula that minimizes the meta-loss. Intuitively, we think of the agent as an optimization algorithm and the environment as being characterized by the family of objective functions that we’d like to learn an optimizer for. The state consists of the current iterate and some features along the optimization trajectory so far, which could be some statistic of the history of gradients, iterates and objective values. The action is the step vector that is used to update the iterate.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/rl_formulation.png" alt="The state is the iterate and some statistic of the history of gradients, iterates and objective values. The action is the step vector. Under this formulation, a particular policy corresponds to a particular update formula. The cost is the objective value." />
</p>
<p>Under this formulation, the policy is essentially a procedure that computes the action, which is the step vector, from the state, which depends on the current iterate and the history of gradients, iterates and objective values. In other words, a particular policy represents a particular update formula. Hence, learning the policy is equivalent to learning the update formula, and hence the optimization algorithm. The initial state probability distribution is the joint distribution of the initial iterate, gradient and objective value. The state transition probability distribution characterizes what the next state is likely to be given the current state and action. Since the state contains the gradient and objective value, the state transition probability distribution captures how the gradient and objective value are likely to change for any given step vector. In other words, it encodes the likely local geometries of the objective functions of interest. Crucially, the reinforcement learning algorithm does not have direct access to this state transition probability distribution, and therefore the policy it learns avoids overfitting to the geometry of the training objective functions.</p>
<p>We choose a cost function of a state to be the value of the objective function evaluated at the current iterate. Because reinforcement learning minimizes the cumulative cost over all time steps, it essentially minimizes the sum of objective values over all iterations, which is the same as the meta-loss.</p>
<h2 id="results">Results</h2>
<p>We trained an optimization algorithm on the problem of training a neural net on MNIST, and tested it on the problems of training different neural nets on the Toronto Faces Dataset (TFD), CIFAR-10 and CIFAR-100. These datasets bear little similarity to each other: MNIST consists of black-and-white images of handwritten digits, TFD consists of grayscale images of human faces, and CIFAR-10/100 consists of colour images of common objects in natural scenes. It is therefore unlikely that a learned optimization algorithm can get away with memorizing, say, the lower layer weights, on MNIST and still do well on TFD and CIFAR-10/100.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/results.png" alt="Our algorithm, which is trained on MNIST, consistently outperforms other optimization algorithms on TFD, CIFAR-10 and CIFAR-100." />
</p>
<p>As shown, the optimization algorithm trained using our approach on MNIST (shown in light red) generalizes to TFD, CIFAR-10 and CIFAR-100 and outperforms other optimization algorithms.</p>
<p>To understand the behaviour of optimization algorithms learned using our approach, we trained an optimization algorithm on two-dimensional logistic regression problems and visualized its trajectory in the space of the parameters. It is worth noting that the behaviours of optimization algorithms in low dimensions and high dimensions may be different, and so the visualizations below may not be indicative of the behaviours of optimization algorithms in high dimensions. However, they provide some useful intuitions about the kinds of behaviour that can be learned.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/traj_visualizations.png" alt="Our algorithm is able to recover after overshooting without oscillating and converge quickly when gradients are small." />
</p>
<p>The plots above show the optimization trajectories followed by various algorithms on two different unseen logistic regression problems. Each arrow represents one iteration of an optimization algorithm. As shown, the algorithm learned using our approach (shown in light red) takes much larger steps compared to other algorithms. In the first example, because the learned algorithm takes large steps, it overshoots after two iterations, but does not oscillate and instead takes smaller steps to recover. In the second example, due to vanishing gradients, traditional optimization algorithms take small steps and therefore converge slowly. On the other hand, the learned algorithm takes much larger steps and converges faster.</p>
<h2 id="papers">Papers</h2>
<p>More details can be found in our papers:</p>
<p><strong>Learning to Optimize</strong><br />
Ke Li, Jitendra Malik<br />
<a href="https://arxiv.org/abs/1606.01885" title="Learning to Optimize"><em>arXiv:1606.01885</em></a>, 2016 and <a href="https://openreview.net/forum?id=ry4Vrt5gl" title="Learning to Optimize"><em>International Conference on Learning Representations (ICLR)</em></a>, 2017</p>
<p><strong>Learning to Optimize Neural Nets</strong><br />
Ke Li, Jitendra Malik<br />
<a href="https://arxiv.org/abs/1703.00441" title="Learning to Optimize Neural Nets"><em>arXiv:1703.00441</em></a>, 2017</p>
<p><em>I’d like to thank Jitendra Malik for his valuable feedback.</em></p>
Tue, 12 Sep 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/09/12/learning-to-optimize-with-rl/
http://bair.berkeley.edu/blog/2017/09/12/learning-to-optimize-with-rl/Learning a Multi-View Stereo Machine<p>Consider looking at a photograph of a chair.
We humans have the remarkable capacity of inferring properties about the 3D shape of the chair from this single photograph even if we might not have seen such a chair ever before.
A more representative example of our experience though is being in the same physical space as the chair and accumulating information from various viewpoints around it to build up our hypothesis of the chair’s 3D shape.
How do we solve this complex 2D to 3D inference task? What kind of cues do we use?<br />
How do we seamlessly integrate information from just a few views to build up a holistic 3D model of the scene?</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/problem_fig.png" alt="Problem Statement" /></p>
<p>A vast body of work in computer vision has been devoted to developing algorithms which leverage various cues from images that enable this task of 3D reconstruction.
They range from monocular <a href="http://www.eruptingmind.com/depth-perception-cues-other-forms-of-perception/">cues</a> such as shading, linear perspective, size constancy etc. to binocular and even multi-view stereopsis.
The dominant paradigm for integrating multiple views has been to leverage stereopsis, i.e. if a point in the 3D world is viewed from multiple viewpoints, its location in 3D can be determined by triangulating its projections in the respective views.
This family of algorithms has led to work on Structure from Motion (SfM) and Multi-view Stereo (MVS) and have been used to produce <a href="https://grail.cs.washington.edu/rome/">city-scale</a> <a href="http://www.di.ens.fr/pmvs/">3D models</a> and enable rich visual experiences such as <a href="http://mashable.com/2017/06/28/apple-maps-flyover/">3D flyover</a> <a href="https://vr.google.com/earth/">maps</a>.
With the advent of deep neural networks and their immense power in modelling visual data, the focus has recently shifted to modelling monocular cues implicitly with a CNN and predicting 3D from a single image as <a href="http://www.cs.nyu.edu/~deigen/dnl/">depth</a>/<a href="http://www.cs.cmu.edu/~xiaolonw/deep3d.html">surface orientation</a> maps or 3D <a href="http://3d-r2n2.stanford.edu/">voxel</a> <a href="https://rohitgirdhar.github.io/GenerativePredictableVoxels/">grids</a>.</p>
<p>In our <a href="https://arxiv.org/abs/1708.05375">recent work</a>, we tried to unify these paradigms of single and multi-view 3D reconstruction.
We proposed a novel system called a Learnt Stereo Machine (LSM) that can leverage monocular/semantic cues for single-view 3D reconstruction while also being able to integrate information from multiple viewpoints using stereopsis - all within a single end-to-end learnt deep neural network.</p>
<!--more-->
<h2 id="learnt-stereo-machines">Learnt Stereo Machines</h2>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/Network.png" alt="Learnt Stereo Machine" />
LSMs are designed to solve the task of multi-view stereo. Given a set of images with <em>known camera poses</em>, they produce a 3D model for the underlying scene - specifically either a voxel occupancy grid or a dense point cloud of the scene in the form of a pixel-wise depth map per input view. While designing LSMs, we drew inspiration from classic works on MVS. These methods first <em>extract features</em> from the images for finding correspondences between them. By comparing the features between images, a matching cost volume is formed. These (typically noisy) matching costs are then <em>filtered/regularized</em> by aggregating information across multiple scales and incorporating priors on shape such as local smoothness, piecewise planarity etc. The final filtered cost volume is then decoded into the desired shape representation such as a 3D volume/surface/disparity maps.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/proj_gif.gif" style="width:45%; margin-left:4%; border-right:solid; border-width:1px; border-color:rgba(0,0,0,0.42);" />
<img src="http://bair.berkeley.edu/blog/assets/unified-3d/unproj_gif.gif" style="width:45%; margin-right:4%" /></p>
<p>The key ingredients here are a differentiable feature <strong>projection</strong> and <strong>unprojection</strong> modules which allow LSMs to move between 2D image and 3D world spaces in a geometrically consistent manner. The unprojection operation places features from a 2D image (extracted by a feedforward CNN) into a 3D world grid such that features from multiple such images align in the 3D grid according to epipolar constraints. This simplifies feature matching as now a search along an epipolar line to compute matching costs reduces to just looking up all features which map to a given location in the 3D world grid. This feature matching is modeled using a 3D recurrent unit which performs sequential matching of the unprojected grids while maintaining a running estimate of the matching scores. Once we filter the local matching cost volume using a 3D CNN, we either decode it directly into a 3D voxel occupancy grid for the voxel prediction task or project it back into 2D image space using a differentiable projection operation. The projection operation can be thought of as the inverse of the unprojection operation where we take a 3D feature grid and sample features along viewing rays at equal depth intervals to place them in a 2D feature map. These projected feature maps are then decoded into per view depth maps by a series of convolution operations. As every step in our network is completely differentiable, we can train the system end-to-end with depth maps or voxel grids as supervision!</p>
<p>As LSMs can predict 3D from a variable number of images (even just a single image), they can choose to either rely heavily on multi-view stereopsis cues or single-view semantic cues depending on the instance and number of views at hand. LSMs can produce both coarse full 3D voxel grids as well as dense depth maps thus unifying the two major paradigms in 3D prediction using deep neural networks.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/voxel_results.png" alt="voxel" /></p>
<p>In our report, we showed drastic improvements on voxel based multi-view 3D object reconstruction when compared to the <a href="http://3d-r2n2.stanford.edu/">previous state-of-the-art</a> which integrates multiple views using a recurrent neural network. We also demonstrated out-of-category generalization, i.e. LSMs can reconstruct cars even if they are only trained on images of aeroplanes and chairs. This is only possible due to our geometric treatment of the task.
We also show dense reconstructions from a few views - much fewer than what is required by classical MVS systems.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/depth_results.png" alt="voxel" /></p>
<h2 id="whats-next">What’s Next?</h2>
<p>LSMs are a step towards unifying a number of paradigms in 3D reconstruction - single and multi-view, semantic and geometric reconstruction, coarse and dense predictions. A joint treatment of these problems helps us learn models that are more robust and accurate while also being simpler to deploy than pipelined solutions.</p>
<p>These are exciting times in 3D computer vision. Predicting <a href="http://bair.berkeley.edu/blog/2017/08/23/high-quality-3d-obj-reconstruction/">high resolution geometry</a> with deep networks is now possible. We can even train for 3D prediction <a href="http://bair.berkeley.edu/blog/2017/07/11/confluence-of-geometry-and-learning/">without explicit 3D</a> supervision. We can’t wait to use these techniques/ideas within LSMs. It remains to be seen how lifting images from 2D to 3D and reasoning about them in metric world space would help other downstream tasks such as navigation and grasping but it sure will be an interesting journey! We will release the code for LSMs soon for easy experimentation and reproducibility. Feel free to use it and leave comments!</p>
<hr />
<p>We would like to thank Saurabh Gupta, Shubham Tulsiani and David Fouhey.</p>
<p><strong>This blog post is based on the following report</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/1708.05375"><em>Learning a Multi-view Stereo Machine</em></a><br />
<a href="https://people.eecs.berkeley.edu/~akar/">Abhishek Kar</a>, <a href="https://people.eecs.berkeley.edu/~chaene/">Christian Häne</a>, <a href="https://people.eecs.berkeley.edu/~malik/">Jitendra Malik</a>, NIPS, 2017</li>
</ul>
Tue, 05 Sep 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/09/05/unified-3d/
http://bair.berkeley.edu/blog/2017/09/05/unified-3d/How to Escape Saddle Points Efficiently<p><em>This post was initially published on <a href="http://www.offconvex.org/2017/07/19/saddle-efficiency/">Off the Convex Path</a>. It is reposted here with authors’ permission.</em></p>
<p>A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s</a> and <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s</a> blog posts), the critical open problem is one of <strong>efficiency</strong> — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe <a href="https://arxiv.org/abs/1703.00887">our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade</a>, that provides the first provable <em>positive</em> answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!</p>
<!--more-->
<h2 id="perturbing-gradient-descent">Perturbing Gradient Descent</h2>
<p>We are in the realm of classical gradient descent (GD) — given a function $f:\mathbb{R}^d \to \mathbb{R}$ we aim to minimize the function by moving in the direction of the negative gradient:</p>
<script type="math/tex; mode=display">x_{t+1} = x_t - \eta \nabla f(x_t),</script>
<p>where $x_t$ are the iterates and $\eta$ is the step size. GD is well understood theorietically in the case of convex optimization, but the general case of nonconvex optimization has been far less studied. We know that GD converges quickly to the neighborhood of stationary points (points where $\nabla f(x) = 0$) in the nonconvex setting, but these stationary points may be local minima or, unhelpfully, local maxima or saddle points.</p>
<p>Clearly GD will never move away from a stationary point if started there (even a local maximum); thus, to provide general guarantees, it is necessary to modify GD slightly to incorporate some degree of randomness. Two simple methods have been studied in the literature:</p>
<ol>
<li>
<p><strong>Intermittent Perturbations</strong>: <a href="http://arxiv.org/abs/1503.02101">Ge, Huang, Jin and Yuan 2015</a> considered adding occasional random perturbations to GD, and were able to provide the first <em>polynomial time</em> guarantee for GD to escape saddle points. (See also <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s post</a> )</p>
</li>
<li>
<p><strong>Random Initialization</strong>: <a href="http://arxiv.org/abs/1602.04915">Lee et al. 2016</a> showed that with only random initialization, GD provably avoids saddle points asymptotically (i.e., as the number of steps goes to infinity). (see also <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s post</a>)</p>
</li>
</ol>
<p>Asymptotic — and even polynomial time —results are important for the general theory, but they stop short of explaining the success of gradient-based algorithms in practical nonconvex problems. And they fail to provide reassurance that runs of GD can be trusted — that we won’t find ourselves in a situation in which the learning curve flattens out for an indefinite amount of time, with the user having no way of knowing that the asymptotics have not yet kicked in. Lastly, they fail to provide reassurance that GD has the kind of favorable properties in high dimensions that it is known to have for convex problems.</p>
<p>One reasonable approach to this issue is to consider second-order (Hessian-based) algorithms. Although these algorithms are generally (far) more expensive per iteration than GD, and can be more complicated to implement, they do provide the kind of geometric information around saddle points that allows for efficient escape. Accordingly, a reasonable understanding of Hessian-based algorithms has emerged in the literature, and positive efficiency results have been obtained.</p>
<p><strong><em>Is GD also efficient? Or is the Hessian necessary for fast escape of saddle points?</em></strong></p>
<p>A negative result emerges to this first question if one considers the random initialization strategy discussed. Indeed, this approach is provably <em>inefficient</em> in general, taking exponential time to escape saddle points in the worst case (see “On the Necessity of Adding Perturbations” section).</p>
<p>Somewhat surprisingly, it turns out that we obtain a rather different — and <em>positive</em> — result if we consider the perturbation strategy. To be able to state this result, let us be clear on the algorithm that we analyze:</p>
<blockquote>
<p><strong>Perturbed gradient descent (PGD)</strong></p>
<ol>
<li><strong>for</strong> $~t = 1, 2, \ldots ~$ <strong>do</strong></li>
<li>$\quad\quad x_{t} \leftarrow x_{t-1} - \eta \nabla f (x_{t-1})$</li>
<li>$\quad\quad$ <strong>if</strong> $~$<em>perturbation condition holds</em>$~$ <strong>then</strong></li>
<li>$\quad\quad\quad\quad x_t \leftarrow x_t + \xi_t$</li>
</ol>
</blockquote>
<p>Here the perturbation $\xi_t$ is sampled uniformly from a ball centered at zero with a suitably small radius, and is added to the iterate when the gradient is suitably small. These particular choices are made for analytic convenience; we do not believe that uniform noise is necessary. nor do we believe it essential that noise be added only when the gradient is small.</p>
<h2 id="strict-saddle-and-second-order-stationary-points">Strict-Saddle and Second-order Stationary Points</h2>
<p>We define <em>saddle points</em> in this post to include both classical saddle points as well as local maxima. They are stationary points which are locally maximized along <em>at least one direction</em>. Saddle points and local minima can be categorized according to the minimum eigenvalue of Hessian:</p>
<script type="math/tex; mode=display">% <![CDATA[
\lambda_{\min}(\nabla^2 f(x)) \begin{cases}
> 0 \quad\quad \text{local minimum} \\
= 0 \quad\quad \text{local minimum or saddle point} \\
< 0 \quad\quad \text{saddle point}
\end{cases} %]]></script>
<p>We further call the saddle points in the last category, where $\lambda_{\min}(\nabla^2 f(x)) < 0$, <strong>strict saddle points</strong>.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/saddle_eff/strictsaddle.png" class="stretch-center" /></p>
<p>While non-strict saddle points can be flat in the valley, strict saddle points require that there is <em>at least one direction</em> along which the curvature is strictly negative. The presence of such a direction gives a gradient-based algorithm the possibility of escaping the saddle point. In general, distinguishing local minima and non-strict saddle points is <em>NP-hard</em>; therefore, we — and previous authors — focus on escaping <em>strict</em> saddle points.</p>
<p>Formally, we make the following two standard assumptions regarding smoothness.</p>
<blockquote>
<p><strong>Assumption 1</strong>: $f$ is $\ell$-gradient-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2, |\nabla f(x_1) - \nabla f(x_2)| \le \ell |x_1 - x_2|$. <br />
$~$<br />
<strong>Assumption 2</strong>: $f$ is $\rho$-Hessian-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2$, $|\nabla^2 f(x_1) - \nabla^2 f(x_2)| \le \rho |x_1 - x_2|$.</p>
</blockquote>
<p>Similarly to classical theory, which studies convergence to a first-order stationary point, $\nabla f(x) = 0$, by bounding the number of iterations to find a <strong>$\epsilon$-first-order stationary point</strong>, $|\nabla f(x)| \le \epsilon$, we formulate the speed of escape of strict saddle points and the ensuing convergence to a second-order stationary point, $\nabla f(x) = 0, \lambda_{\min}(\nabla^2 f(x)) \ge 0$, with an $\epsilon$-version of the definition:</p>
<blockquote>
<p><strong>Definition</strong>: A point $x$ is an <strong>$\epsilon$-second-order stationary point</strong> if:<br />
$\quad\quad\quad\quad |\nabla f(x)|\le \epsilon$, and $\lambda_{\min}(\nabla^2 f(x)) \ge -\sqrt{\rho \epsilon}$.</p>
</blockquote>
<p>In this definition, $\rho$ is the Hessian Lipschitz constant introduced above. This scaling follows the convention of <a href="http://rd.springer.com/article/10.1007%2Fs10107-006-0706-8">Nesterov and Polyak 2006</a>.</p>
<h3 id="applications">Applications</h3>
<p>In a wide range of practical nonconvex problems it has been proved that <strong>all saddle points are strict</strong> — such problems include, but not are limited to, principal components analysis, canonical correlation analysis,
<a href="http://arxiv.org/abs/1503.02101">orthogonal tensor decomposition</a>,
<a href="http://arxiv.org/abs/1602.06664">phase retrieval</a>,
<a href="http://arxiv.org/abs/1504.06785">dictionary learning</a>,
<!-- matrix factorization, -->
<a href="http://arxiv.org/abs/1605.07221">matrix sensing</a>,
<a href="http://arxiv.org/abs/1605.07272">matrix completion</a>,
and <a href="http://arxiv.org/abs/1704.00708">other nonconvex low-rank problems</a>.</p>
<p>Furthermore, in all of these nonconvex problems, it also turns out that <strong>all local minima are global minima</strong>. Thus, in these cases, any general efficient algorithm for finding $\epsilon$-second-order stationary points immediately becomes an efficient algorithm for solving those nonconvex problem with global guarantees.</p>
<h2 id="escaping-saddle-point-with-negligible-overhead">Escaping Saddle Point with Negligible Overhead</h2>
<p>In the classical case of first-order stationary points, GD is known to have very favorable theoretical properties:</p>
<blockquote>
<p><strong>Theorem (<a href="http://rd.springer.com/book/10.1007%2F978-1-4419-8853-9">Nesterov 1998</a>)</strong>: If Assumption 1 holds, then GD, with $\eta = 1/\ell$, finds an $\epsilon$-<strong>first</strong>-order stationary point in $2\ell (f(x_0) - f^\star)/\epsilon^2$ iterations.</p>
</blockquote>
<p>In this theorem, $x_0$ is the initial point and $f^\star$ is the function value of the global minimum. The theorem says for that any gradient-Lipschitz function, a stationary point can be found by GD in $O(1/\epsilon^2)$ steps, with no explicit dependence on $d$. This is called “dimension-free optimization” in the literature; of course the cost of a gradient computation is $O(d)$, and thus the overall runtime of GD scales as $O(d)$. The linear scaling in $d$ is especially important for modern high-dimensional nonconvex problems such as deep learning.</p>
<p>We now wish to address the corresponding problem for second-order stationary points.
What is the best we can hope for? Can we also achieve</p>
<ol>
<li>A dimension-free number of iterations;</li>
<li>An $O(1/\epsilon^2)$ convergence rate;</li>
<li>The same dependence on $\ell$ and $(f(x_0) - f^\star)$ as in (Nesterov 1998)?</li>
</ol>
<p>Rather surprisingly, the answer is <em>Yes</em> to all three questions (up to small log factors).</p>
<blockquote>
<p><strong>Main Theorem</strong>: If Assumptions 1 and 2 hold, then PGD, with $\eta = O(1/\ell)$, finds an $\epsilon$-<strong>second</strong>-order stationary point in $\tilde{O}(\ell (f(x_0) - f^\star)/\epsilon^2)$ iterations with high probability.</p>
</blockquote>
<p>Here $\tilde{O}(\cdot)$ hides only logarithmic factors; indeed, the dimension dependence in our result is only $\log^4(d)$. The theorem thus asserts that a perturbed form of GD, under an additional Hessian-Lipschitz condition, <strong><em>converges to a second-order-stationary point in almost the same time required for GD to converge to a first-order-stationary point.</em></strong> In this sense, we claim that PGD can escape strict saddle points almost for free.</p>
<p>We turn to a discussion of some of the intuitions underlying these results.</p>
<h3 id="why-do-polylogd-iterations-suffice">Why do polylog(d) iterations suffice?</h3>
<p>Our strict-saddle assumption means that there is only, in the worst case, one direction in $d$ dimensions along which we can escape. A naive search for the descent direction intuitively should take at least $\text{poly}(d)$ iterations, so why should only $\text{polylog}(d)$ suffice?</p>
<p>Consider a simple case in which we assume that the function is quadratic in the neighborhood of the saddle point. That is, let the objective function be $f(x) = x^\top H x$, a saddle point at zero, with constant Hessian $H = \text{diag}(-1, 1, \cdots, 1)$. In this case, only the first direction is an escape direction (with negative eigenvalue $-1$).</p>
<p>It is straightforward to work out the general form of the iterates in this case:</p>
<script type="math/tex; mode=display">x_t = x_{t-1} - \eta \nabla f(x_{t-1}) = (I - \eta H)x_{t-1} = (I - \eta H)^t x_0.</script>
<p>Assume that we start at the saddle point at zero, then add a perturbation so that $x_0$ is sampled uniformly from a ball $\mathcal{B}_0(1)$ centered at zero with radius one.
The decrease in the function value can be expressed as:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = x_t^\top H x_t = x_0^\top (I - \eta H)^t H (I - \eta H)^t x_0.</script>
<p>Set the step size to be $1/2$, let $\lambda_i$ denote the $i$-th eigenvalue of the Hessian $H$ and let $\alpha_i = e_i^\top x_0$ denote the component in the $i$th direction of the initial point $x_0$. We have $\sum_{i=1}^d \alpha_i^2 = | x_0|^2 = 1$, thus:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = \sum_{i=1}^d \lambda_i (1-\eta\lambda_i)^{2t} \alpha_i^2 \le -1.5^{2t} \alpha_1^2 + 0.5^{2t}.</script>
<p>A simple probability argument shows that sampling uniformly in $\mathcal{B}_0(1)$ will result in at least a $\Omega(1/d)$ component in the first direction with high probability. That is, $\alpha^2_1 = \Omega(1/d)$. Substituting $\alpha_1$ in the above equation, we see that it takes at most $O(\log d)$ steps for the function value to decrease by a constant amount.</p>
<h3 id="pancake-shape-stuck-region-for-general-hessian">Pancake-shape stuck region for general Hessian</h3>
<p>We can conclude that for the case of a constant Hessian, only when the perturbation $x_0$ lands in the set $\{x | ~ |e_1^\top x|^2 \le O(1/d)\}$ $\cap \mathcal{B}_0 (1)$, can we take a very long time to escape the saddle point. We call this set the <strong>stuck region</strong>; in this case it is a flat disk. In general, when the Hessian is no longer constant, the stuck region becomes a non-flat pancake, depicted as a green object in the left graph. In general this region will not have an analytic expression.</p>
<p>Earlier attempts to analyze the dynamics around saddle points tried to the approximate stuck region by a flat set. This results in a requirement of an extremely small step size and a correspondingly very large runtime complexity. Our sharp rate depends on a key observation — <em>although we don’t know the shape of the stuck region, we know it is very thin</em>.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/saddle_eff/flow.png" class="stretch-center" /></p>
<p>In order to characterize the “thinness” of this pancake, we studied pairs of hypothetical perturbation points $w, u$ separated by $O(1/\sqrt{d})$ along an escaping direction. We claim that if we run GD starting at $w$ and $u$, at least one of the resulting trajectories will escape the saddle point very quickly. This implies that the thickness of the stuck region can be at most $O(1/\sqrt{d})$, so a random perturbation has very little chance to land in the stuck region.</p>
<h2 id="on-the-necessity-of-adding-perturbations">On the Necessity of Adding Perturbations</h2>
<p>We have discussed two possible ways to modify the standard gradient descent algorithm, the first by adding intermittent perturbations, and the second by relying on random initialization. Although the latter exhibits asymptotic convergence, it does not yield efficient convergence in general; in recent <a href="http://arxiv.org/abs/1705.10412">joint work with Simon Du, Jason Lee, Barnabas Poczos, and Aarti Singh</a>, we have shown that even with fairly natural random initialization schemes and non-pathological functions, <strong>GD with only random initialization can be significantly slowed by saddle points, taking exponential time to escape. The behavior of PGD is strikingingly different — it can generically escape saddle points in polynomial time.</strong></p>
<p>To establish this result, we considered random initializations from a very general class including Gaussians and uniform distributions over the hypercube, and we constructed a smooth objective function that satisfies both Assumptions 1 and 2. This function is constructed such that, even with random initialization, with high probability both GD and PGD have to travel sequentially in the vicinity of $d$ strict saddle points before reaching a local minimum. All strict saddle points have only one direction of escape. (See the left graph for the case of $d=2$).</p>
<p><img src="http://bair.berkeley.edu/blog/assets/saddle_eff/necesperturbation.png" class="stretch-center" /></p>
<p>When GD travels in the vicinity of a sequence of saddle points, it can get closer and closer to the later saddle points, and thereby take longer and longer to escape. Indeed, the time to escape the $i$th saddle point scales as $e^{i}$. On the other hand, PGD is always able to escape any saddle point in a small number of steps independent of the history. This phenomenon is confirmed by our experiments; see, for example, an experiment with $d=10$ in the right graph.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we have shown that a perturbed form of gradient descent can converge to a second-order-stationary point at almost the same rate as standard gradient descent converges to a first-order-stationary point. This implies that Hessian information is not necessary for to escape saddle points efficiently, and helps to explain why basic gradient-based algorithms such as GD (and SGD) work surprisingly well in the nonconvex setting. This new line of sharp convergence results can be directly applied to nonconvex problem such as matrix sensing/completion to establish efficient global convergence rates.</p>
<p>There are of course still many open problems in general nonconvex optimization. To name a few: will adding momentum improve the convergence rate to a second-order stationary point? What type of local minima are tractable and are there useful structural assumptions that we can impose on local minima so as to avoid local minima efficiently? We are making slow but steady progress on nonconvex optimization, and there is the hope that at some point we will transition from “black art” to “science”.</p>
Thu, 31 Aug 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/High Quality 3D Object Reconstruction from a Single Color Image<p>Digitally reconstructing 3D geometry from images is a core problem in computer vision. There are various applications, such as movie productions, content generation for video games, virtual and augmented reality, 3D printing and many more. The task discussed in this blog post is reconstructing high quality 3D geometry from a single color image of an object as shown in the figure below.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/hsp/image_0.png" width="600" />
</p>
<p>Humans have the ability to effortlessly reason about the shapes of objects and scenes even if we only see a single image. Note that the binocular arrangement of our eyes allows us to perceive depth, but it is not required to understand 3D geometry. Even if we only see a photograph of an object we have a good understanding of its shape. Moreover, we are also able to reason about the unseen parts of objects such as the back, which is an important ability for grasping objects. The question which immediately arises is how are humans able to reason about geometry from a single image? And in terms of artificial intelligence: how can we teach machines this ability?</p>
<!--more-->
<h1 id="shape-spaces">Shape Spaces</h1>
<p>The basic principle used to reconstruct geometry from ambiguous input is the fact that shapes are not arbitrary, and hence some shapes are likely, and some very unlikely. In general surfaces tend to be smooth. In man-made environments they are often piece-wise planar. For objects high level rules apply. For example airplanes very commonly have a fuselage with two main wings attached on each side and on the back a vertical stabilizer. Humans are able to acquire this knowledge by observing the world with their eyes and interacting with the world using their hands. In computer vision the fact that shapes are not arbitrary allows us to describe all possible shapes of an object class or multiple object classes as a low dimensional shape space, which is acquired from large collections of example shapes.</p>
<h2 id="voxel-prediction-using-cnns">Voxel Prediction Using CNNs</h2>
<p>One of the most recent lines of work for 3D reconstruction [<a href="https://arxiv.org/abs/1604.00449">Choy et al. ECCV 2016</a>, <a href="https://arxiv.org/abs/1603.08637">Girdhar et al. ECCV 2016</a>] utilizes convolutional neural networks (CNNs) to predict the shape of objects as a 3D occupancy volume. The 3D output volume is subdivided into volume elements, called voxels, and for each voxel an assignment to be either occupied or free space, i.e. the interior or exterior of the object respectively, is determined. The input is commonly given as a single color image which depicts the object, and the CNN predicts an occupancy volume using an up-convolutional decoder architecture. The network is trained end-to-end and supervised with known ground truth occupancy volumes which are acquired from synthetic CAD model datasets. Using this 3D representation and CNNs, models which are able to fit into a variety of object classes can be learned.</p>
<h1 id="hierarchical-surface-prediction">Hierarchical Surface Prediction</h1>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_1.png" class="stretch-center" /></p>
<p>The main shortcoming with predicting occupancy volumes using a CNN is that the output space is three dimensional and hence has cubic growth with respect to increased resolution. This problem prevents the works mentioned above from predicting high quality geometry and is therefore restricted to coarse resolution voxel grids, e.g. 32<sup>3</sup> (c.f. figure above). In our work we argue that this is an unnecessary restriction given that surfaces are actually only two dimensional. We exploit the two dimensional nature of surfaces by hierarchically predicting fine resolution voxels only where a surface is expected judging from the low resolution prediction. The basic idea is closely related to octree representations which are often used in multi-view stereo and depth map fusion to represent high resolution geometry.</p>
<h2 id="method">Method</h2>
<p>The basic 3D prediction pipeline takes a color image as input which gets first encoded into a low dimensional representation using a convolutional encoder. This low dimensional representation then gets decoded into a 3D occupancy volume. The main idea of our method, called hierarchical surface prediction (HSP), is to start decoding by predicting low resolution voxels. However, in contrast to the standard approach where each voxel would get classified into either free or occupied space, we use three classes: free space, occupied space, and boundary. This allows us to analyze the outputs at low resolution and only predict a higher resolution of the parts of the volume where there is evidence that it contains the surface. By iterating the refinement procedure we hierarchically predict high resolution voxel grids (see figure below). For more details about the method we refer the reader to our tech report [<a href="https://arxiv.org/abs/1704.00710">Häne et al. arXiv 2017</a>].</p>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_2.png" class="stretch-center" /></p>
<h2 id="experiments">Experiments</h2>
<p>Our experiments are mainly conducted on the synthetic <a href="https://shapenet.org/">ShapeNet</a> dataset [<a href="https://arxiv.org/abs/1512.03012">Chang et al. arXiv 2015</a>]. The main task we studied is predicting high resolution geometry from a single color image. We compare our method to two baselines which we call low resolution hard (LR hard) and low resolution soft (LR soft). These baselines predict at the same coarse resolution of 32<sup>3</sup> but differ in how the training data is generated. The LR hard baseline uses binary assignments for the voxels. All voxels are labeled as occupied if at least one of the corresponding high resolution voxels is occupied. The LR soft baseline uses fractional assignments reflecting the percentage of occupied voxels in the corresponding high resolution voxels. Our method, HSP predicts at a resolution of 256<sup>3</sup>. The results in the figures below show the benefits in terms of surface quality and completeness of the high resolution prediction compared to the low resolution baselines. Quantitative results and more experiments can be found in our tech report.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_3.png" class="stretch-center" /></p>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_4.png" class="stretch-center" /></p>
<p>I would like to thank Shubham Tulsiani and Jitendra Malik for their valuable feedback.</p>
<p><strong>This blog post is based on the tech report:</strong></p>
<ul>
<li>Hierarchical Surface Prediction for 3D Object Reconstruction, C. Häne, S.Tulsiani, J.Malik, ArXiv 2017</li>
</ul>
Wed, 23 Aug 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/08/23/high-quality-3d-obj-reconstruction/
http://bair.berkeley.edu/blog/2017/08/23/high-quality-3d-obj-reconstruction/