The Berkeley Artificial Intelligence Research BlogThe BAIR Blog
http://bair.berkeley.edu/blog/
Learning Long Duration Sequential Task Structure From Demonstrations with Application in Surgical Robotics<p style="text-align:center;">
<!--@Daniel arrange this however you want-->
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/cutting-gif.gif" height="180" style="margin: 10px;" />
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/binpicking-gif.gif" height="180" style="margin: 10px;" />
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/debridement-gif.gif" height="180" style="margin: 10px;" />
<br />
</p>
<p>Deep imitation learning and deep reinforcement learning have potential to learn
robot control policies that map high-dimensional sensor inputs to controls.
While these approaches have been very successful at learning short duration tasks, such
as grasping (Pinto and Gupta 2016, Levine et al. 2016) and peg insertion (Levine
et al. 2016), scaling learning to longer time horizons can require a prohibitive
amount of demonstration data—whether acquired from experts or self-supervised.
Long-duration sequential tasks suffer from the classic problem of “temporal
credit assignment”, namely, the difficulty in assigning credit (or blame) to
actions under uncertainty of the time when their consequences are observed
(Sutton 1984). However, long-term behaviors are often composed of short-term
skills that solve decoupled subtasks. Consider designing a controller for
parallel parking where the overall task can be decomposed into three phases:
pulling up, reversing, and adjusting. Similarly, assembly tasks can often be
decomposed into individual steps based on which parts need to be manipulated.
These short-term skills can be parametrized more concisely—as an analogy,
consider locally linear approximations to an overall nonlinear function—and
this reduced parametrization can be substantially easier to learn.</p>
<p>This post summarizes results from three recent papers that propose algorithms
that learn to decompose a longer task into shorter subtasks. We report
experiments in the context of autonomous surgical subtasks and we believe the
results apply to a variety of applications from manufacturing to home robotics.
We present three algorithms: Transition State Clustering (TSC), Sequential
Windowed Inverse Reinforcement Learning (SWIRL), and Deep Discovery of
Continuous Options (DDCO). TSC considers robustly learning important switching
events (significant changes in motion) that occur across all demonstrations.
SWIRL proposes an algorithm that approximates a value function by a sequence of
shorter term quadratic rewards. DDCO is a general framework for imitation
learning with a hierarchical representation of the action space. In retrospect,
all three algorithms are special cases of the same general framework, where the
demonstrator’s behavior is generatively modeled as a sequential composition of
unknown closed-loop policies that switch when reaching parameterized “transition
states”.</p>
<!--more-->
<h1 id="application-to-surgical-robotics">Application to Surgical Robotics</h1>
<p>Robots such as Intuitive Surgical’s da Vinci have facilitated millions of
surgical procedures worldwide using local teleoperation. Automation of surgical
sub-tasks has the potential to reduce surgeon tedium and fatigue, operating
time, and enable supervised tele-surgery over higher-latency networks. Designing
surgical robot controllers is particularly difficult due to a limited field of
view and imprecise actuation.</p>
<p>As a concrete task, pattern cutting is one of the Fundamentals of Laparoscopic
Surgery, a training suite required of surgical residents. In this standard
surgical training task, the surgeon must cut and remove a pattern printed on a
sheet of gauze, and is scored on time and accuracy:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/pattern-cutting-task.png" alt="Figure 1: Pattern Cutting Task, from the Fundamentals of Laparoscopic Surgery." /><br />
<i>
Pattern cutting task from the Fundamentals of Laparoscopic Surgery.
</i>
</p>
<p>In (Murali 2015), we manually coded this task using hand-crafted Deterministic
Finite Automaton on the da Vinci surgical robot. The DFA integrated 10 different
manipulation primitives and two computer vision based checks:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/dfa-pattern-cutting.png" alt="Figure 2: DFA for Pattern Cutting." width="600" /><br />
<i>
Deterministic finite automaton from Murali et al. 2016 to automate pattern
cutting.
</i>
</p>
<p>Designing this DFA required painstaking trial-and-error, and perceptual checks
required constant tuning to account for lighting and registration changes. This
motivated us to consider the extent to which we could learn such structure from
demonstration data. This blog post describes our efforts over the last three
years at learning hierarchical representations from demonstrations. This
research has helped us automate several surgical robotic tasks with minimal
expert design of the DFA, as shown in the three GIFs at the top of the post.</p>
<h1 id="learning-transition-conditions">Learning Transition Conditions</h1>
<p>The first paper, Transition State Clustering (Krishnan et al. 2015), explores
the problem of learning transition conditions from demonstrations, i.e.,
conditions that trigger a switch or a transition between manipulation behaviors
in a task. In many important tasks, while the actual motions may vary and be
noisy, each demonstration contains roughly the same sequence of primitive
motions. This consistent, repeated structure can be exploited to infer global
transition criteria by identifying state-space conditions correlated with
significant changes in motion. By assuming a known sequential order of
primitives, the problem reduces to segmenting each trajectory and corresponding
those segments across trajectories. This involves finding a common set of
segment-to-segment transition events.</p>
<p>We formalized this intuition in an algorithm called Transition State Clustering
(TSC). Let <script type="math/tex">D=\{d_i\}</script> be a set of demonstrations of a robotic task. Each
demonstration of a task $d$ is a discrete-time sequence of $T$ state vectors in
a feature-space $\mathcal{X}$. The feature space is a concatenation of kinematic
features $X$ (e.g., robot position) and sensory features $V$. These were
low-dimensional visual features from the environment calculated by hard-coded
image processing and manual annotation.</p>
<p>A segmentation of a task is defined as a function $\mathcal{S}$ that maps each
trajectory to a non-decreasing sequence of integers in ${1,2,…,k}$. This
function tells us more than just the endpoints of segments, since it also labels
each segment according to its sub-task. By contrast, a transition indicator
function $\mathcal{T}$ is maps each demonstration $d$ to a sequence of
indicators in ${0,1}$:</p>
<script type="math/tex; mode=display">\mathbf{T}: d \mapsto ( a_t )_{1,...,|d|}, a_t \in {0,1}.</script>
<p>such that <script type="math/tex">\mathcal{T}(d)_t</script> indicates whether the demonstration switched from
one sub-task to another after time $t$. For a demonstration $d_i$, let $o_{i,t}$
denote the kinematic state, visual state, and time $(x,v,t)$ at time $t$.
Transition States are the set of state-time tuples where the indicator is 1:</p>
<script type="math/tex; mode=display">\Gamma = \bigcup_{i}^N ~\{o_{i,t} \in d_i ~: \mathbf{T}(d_i)_t = 1\}.</script>
<p>In TSC, we model the probability distribution that generates $\Gamma$ as a
Gaussian Mixture Model and identify the mixture components. These components
identify regions of the state space correlated with candidate transitions. We
can take any motion-based model for detecting changes in behavior and generate
candidates. Then, we probabilistically ground these candidate transitions in
state-space and perceptual conditions that are consistent across demonstrations.
Intuitively, this algorithm consists of two steps: first segmentation, and then
clustering the segment end-points.</p>
<p>There are a number of important implementation details to make this model work
in practice on real noisy data. Since the kinematic and visual features often
have very different scales and topological properties, we often have to model
them separately during the clustering step. We hierarchically apply a GMM model
by first performing a hard clustering on the kinematic features, and then within
each cluster fitting the probabilistic model over the perceptual features. This
allows us to prune out clusters that are not representative (i.e., do not have
transitions from all demonstrations). Furthermore, Hyper-parameter selection is
a known problem in mixture models. Recent results in Bayesian statistics can
mitigate some of these problems by defining a soft prior of the number of
mixtures. The Dirichlet Process (DP) defines a distribution over the parameters
of discrete distributions, in our case, the probabilities of a categorical
distribution, as well as the size of its support $m$ (Kulis 2011). The
hyper-parameters of the DP can be inferred with variational Expectation
Maximization.</p>
<p>In the pattern cutting task, TSC found the following transition conditions:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/pattern-cutting-concept.png" height="175" style="margin: 30px;" alt="Figure 3: Conceptual diagram of pattern cutting." />
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/pattern-cutting.png" height="200" alt="Figure 3: Conceptual diagram of pattern cutting." /><br />
<i>
Surgical pattern cutting task. Left: manually identified transitions. Right:
automatically discovered transition states (a) and transition state clusters
(b).
</i>
</p>
<p>We marked 6 manually identified primitive motions from (Murali et al. 2015): (1)
start, (2) notch, (3) finish 1st cut, (4) cross-over, (5) finish 2nd cut, and
(6) connect the two cuts. TSC automatically identifies 7 segments, which
correspond well to our prior work. It is worth noting that there is one extra
cluster (marked 2’), that does not correspond to a transition in the manual
segmentation.</p>
<p>At 2’, the operator finishes a notch and begins to cut. While at a logical
level, notching and cutting are both penetration actions, they correspond to two
different motion regimes due to the positioning of the end-effector. TSC
separates them into different clusters even though the human annotators
overlooked this important transition.</p>
<h1 id="connection-to-inverse-reinforcement-learning">Connection to Inverse Reinforcement Learning</h1>
<p>We next explored how the transitions learned by TSC can be used to shape rewards
in long horizon tasks. Sequential Windowed Inverse Reinforcement Learning
(Krishnan et al. 2016), models a task as a sequence of quadratic reward
functions</p>
<script type="math/tex; mode=display">\mathbf{R}_{seq} = [R_1, \ldots ,R_k ]</script>
<p>and transition regions</p>
<script type="math/tex; mode=display">G = [ \rho_1, \ldots,\rho_k ]</script>
<p>such that $R_1$ is the reward function until $\rho_1$ is reached, after which
$R_2$ becomes the reward and so on.</p>
<p>We assume that we have access to a supervisor that provides demonstrations which
are optimal w.r.t an unknown reward function $\mathbf{R}^*$ (not necessarily
quadratic), and which reach each $\rho \in G$ (also unknown) in the same order.
SWIRL is an algorithm to recover <script type="math/tex">\mathbf{R}_{seq}</script> and $G$ from the
demonstration trajectories. SWIRL applies to tasks with a discrete or continuous
state-space and a discrete action-space. The state space can represent spatial,
kinematic, or sensory states (e.g., visual features), as long as the
trajectories are smooth and not very high-dimensional. Finally,
$\mathbf{R}_{seq}$ and $G$ can be used in an RL algorithm to find an optimal
policy for a task.</p>
<p>TSC can be interpreted as inferring the subtask transition regions $G$. Once the
transitions are found, SWIRL applies Maximum Entropy Inverse Reinforcement
Learning to find a local quadratic reward function that guides the robot to the
transition condition. Segmentation further simplifies the estimation of dynamics
models, which are required for inference in MaxEnt-IRL, since many complex
systems can be locallyapproximated linearly in a short time horizon. The goal
of MaxEnt-IRL is to find a reward function such that an optimal policy w.r.t
that reward function is close to the expert demonstration. The agent is modeled
as nosily optimal, where it takes actions from a policy $\pi$:</p>
<script type="math/tex; mode=display">\pi(a \mid s, \theta) \propto \exp\{A_\theta(s,a)\}.</script>
<p>$A_\theta$ is the advantage function (gap between the values of action $a$ and
of the optimal action in state $s$) for the reward parametrized by $\theta$.
The objective is to maximize the log-likelihood that the demonstration
trajectories were generated by $\theta$. In MaxEnt-IRL, this objective can be
estimated reliably in two cases, discrete and linear-gaussian systems, since it
requires an efficient forward search of the policy given a particular reward
parametrized by $\theta$. Thus, we assume that our demonstrations can be modeled
either discretely or with linear dynamics.</p>
<p>Learning a policy from $\mathbf{R}_{seq}$ and $G$ is nontrivial because solving
$k$ independent problems neglects any shared structure in the value function
during the policy learning phase (e.g., a common failure state). Jointly
learning over all segments introduces a dependence on history, namely, any
policy must complete step $i$ before step $i+1$. Learning a memory-dependent
policy could lead to an exponential overhead of additional states. SWIRL
exploits the fact that TSC is in a sense a Markov process, and shows that the
problem can be posed as a proper MDP in a lifted state-space that includes an
indicator variable of the highest-index ${1,…,k}$ transition region that has
been reached so far.</p>
<p>SWIRL applies a variant of Q-Learning to optimize the policy over the sequential
rewards. The basic change to the algorithm is to augment the state-space with an
indicator vector that indicates the transition regions that have been reached.
Each of the rollouts now records a tuple</p>
<script type="math/tex; mode=display">(s, i \in {0,...,k-1},a,r, s', i' \in {0,...,k-1})</script>
<p>that additionally stores this information. The Q function is now defined over
states, actions, and segment index–which also selects the appropriate local
reward function:</p>
<script type="math/tex; mode=display">Q(s,a,v) = R_k(s,a) + \arg \max_{a} Q(s',a, k')</script>
<p>We also need to define an exploration policy, i.e., a stochastic policy with
which we will collect rollouts. To initialize the Q-Learning, we apply
Behavioral Cloning locally for each of the segments to get a policy $\pi_i$. We
apply an $\epsilon$-greedy version of these policies to collect rollouts.</p>
<p>We evaluated SWIRL on a deformable sheet tensioning task. A sheet of surgical
gauze is fixtured at the two far corners using a pair of clips. The unclipped
part of the gauze is allowed to rest on soft silicone padding. The robot’s task
is to reach for the unclipped part, grasp it, lift the gauze, and tension the
sheet to be as planar as possible. An open-loop policy, one that does not react
to unexpected changes, typically fails on this task because it requires some
feedback of whether gauze is properly grasped, how the gauze has deformed after
grasping, and visual feedback of whether the gauze is planar. The task is
sequential, as some grasps pick up more or less of the material and the
flattening procedure has to be accordingly modified.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/tensioning-task.png" alt="Figure 4: Deformable Sheet Tensioning Setup." /><br />
<i>
Deformable sheet tensioning setup.
</i>
</p>
<p>We provided 15 demonstrations through a keyboard-based tele-operation interface.
The average length of the demonstrations was 48.4 actions (although we sampled
observations at a higher frequency, about 10 observations for every action).
From these 15 demonstrations, SWIRL identifies four segments. One of the
segments corresponds to moving to the correct grasping position, one corresponds
to making the grasp, one lifting the gauze up again, and one corresponds to
straightening the gauze. One of the interesting aspects of this task is that the
segmentation requires multiple features, and segmenting any single signal may
miss an important feature.</p>
<p>Then, we tried to learn a policy from the rewards constructed by SWIRL. We
define a Q-Network with a single-layer Multi-Layer Perceptron with 32 hidden
units and sigmoid activation. For each of the segments, we apply Behavioral
Cloning locally with the same architecture as the Q-network (with an additional
softmax over the output layer) to get an initial policy. We roll out 100 trials
with an $\epsilon=0.1$ greedy version of these segmented policies. The results
are depicted below:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/swirl-tensioning.png" alt="Figure 5: Deformable Sheet Tensioning Demonstration." /><br />
<i>
A representative demonstration of the deformable sheet tensioning task with
relevant features plotted over time. SWIRL identifies 4 segments which
correspond to reaching, grasping, lifting, and tensioning.
</i>
</p>
<p>SWIRL achieves more than a 4 times higher reward than ab initio RL, 3 time
higher than pure behavioral cloning, and a 56% higher reward than naively
applying behavioral cloning with TSC segments.</p>
<h1 id="hierarchical-representations">Hierarchical Representations</h1>
<p>We are now exploring a generalization of TSC and SWIRL with a new algorithm:
Deep Discovery of Continuous Options (DDCO Krishnan et al. 2017, to be presented
at the 1st Conference on Robot Learning in November).</p>
<p>An option represents a low-level policy that can be invoked by a high-level
policy to perform a certain sub-task. Formally, an option $h$ in an options set
$\mathcal H$ is specified by a control policy $\pi_h(a_t | s_t)$ and a
stochastic termination condition $\psi_h(s_t)\in[0,1]$. The high-level policy
$\eta(h_t | s_t)$ defines the distribution over options given the state. Once an
option $h$ is invoked, physical controls are selected by the option’s policy
$\pi_h$ until it terminates. After each physical control is applied and the next
state $s’$ is reached, the option $h$ terminates with probability $\psi_h(s’)$,
and if it does then the high-level policy selects a new option $h’$ with
distribution $\eta(h’ | s’)$. Thus the interaction of the hierarchical control
policy $\langle\eta,(\pi_h,\psi_h)_{h\in\mathcal H}\rangle$ with the system
induces a stochastic process over the states $s_t$, the options $h_t$, the
controls $a_t$, and the binary termination indicators $b_t$.</p>
<p>DDCO is a policy-gradient algorithm that discovers parametrized options by
fitting their parameters to maximize the likelihood of a set of demonstration
trajectories. We denote by $\theta$ the vector of all trainable parameters used
for $\eta$ and for $\pi_h$ and $\psi_h$ of each option $h\in\mathcal H$. For
example, $\theta$ can be the weights and biases of a feed-forward network that
computes these probabilities. We wish to find the $\theta\in\Theta$ that
maximizes the log-likelihood of generating each demonstration trajectory
$\xi=(s_0,a_0,s_1,\ldots,s_T)$. The challenge is that this log-likelihood
depends on the latent variables in the stochastic process, the options and the
termination indicators $\zeta = (b_0,h_0,b_1,h_1,\ldots,h_{T-1})$. DDCO
optimizes this objective with an Expectation-Gradient algorithm:</p>
<script type="math/tex; mode=display">\nabla_\theta L[\theta;\xi] = \mathbb{E}_\theta[\nabla_\theta \log \mathbb{P}_\theta(\zeta,\xi) | \xi],</script>
<p>where $\mathbb{P}_\theta(\zeta,\xi)$ is the joint probability of the latent and
observable variables, given by</p>
<script type="math/tex; mode=display">\mathbb{P}_\theta(\zeta,\xi) = p_0(s_0) \delta_{b_0=1}\eta(h_0 | s_0)
\prod_{t=1}^{T-1} \mathbb{P}_\theta(b_t, h_t | h_{t-1}, s_t) \prod_{t=0}^{T-1}
\pi_{h_t}(a_t | s_t) p(s_{t+1} |s_t, a_t) ,</script>
<p>where in the latent transition <script type="math/tex">\mathbb{P}_\theta(b_t, h_t | h_{t-1}, s_t)</script> we have
with probability $\psi_{h_{t-1}}(s_t)$ that $b_t=1$ and $h_t$ is drawn from
$\eta(\cdot|s_t)$, and otherwise that $b_t=0$ and $h_t$ is unchanged, i.e.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{P}_\theta(b_t {=} 1, h_t | h_{t-1}, s_t) &= \psi_{h_{t-1}}(s_t) \eta(h_t | s_t) \\
\mathbb{P}_\theta(b_t {=} 0, h_t | h_{t-1}, s_t) &= (1 - \psi_{h_{t-1}}(s_t)) \delta_{h_t = h_{t-1}}.
\end{align} %]]></script>
<p>The log-likelihood gradient can be computed in two steps, an E-step where the
marginal posteriors</p>
<script type="math/tex; mode=display">u_t(h) = \mathbb{P}_\theta(h_t {=} h | \xi); \quad v_t(h) = \mathbb{P}_\theta(b_t {=} 1,
h_t {=} h | \xi); \quad w_t(h) = \mathbb{P}_\theta(h_t {=} h, b_{t+1} {=} 0 | \xi)</script>
<p>are computed using a forward-backward algorithm similar to Baum-Welch, and a
G-step:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_\theta L[\theta;\xi] &= \sum_{h\in\mathcal{H}} \Biggl( \sum_{t=0}^{T-1}
\Biggl(v_t(h) \nabla_\theta \log \eta(h | s_t) + u_t(h)\nabla_\theta
\log \pi_h(a_t | s_t)\Biggr) \\
& + \sum_{t=0}^{T-2} \Biggl((u_t(h)-w_t(h)) \nabla_\theta \log
\psi_h(s_{t+1}) + w_t(h) \nabla_\theta \log (1 - \psi_h(s_{t+1}))
\Biggr)\Biggr).
\end{align} %]]></script>
<p>The gradient computed above can then be used in any stochastic gradient descent
algorithm. In our experiments we use Adam and Momentum.</p>
<p>We evaluated DDCO in an Imitation Learning setting with surgical robotic tasks.
In one task, the robot is given a foam bin with a pile of 5–8 needles of three
different types, each 1–3mm in diameter. The robot must extract needles of a
specified type and place them in an “accept” cup, while placing all other
needles in a “reject” cup. The task is successful if the entire foam bin is
cleared into the correct cups. To define the state space for this task, we first
generate binary images from overhead stereo images, and apply a color-based
segmentation to identify the needles (the “image” input). Then, we use a
classifier trained in advance on 40 hand-labeled images to identify and provide
a candidate grasp point, specified by position and direction in image space (the
“grasp” input). Additionally, the 6 DoF robot gripper pose and the open-closed
state of the gripper are observed (the “kin” input). The state space of the
robot is (“image”, “grasp”, “kin”), and the control space is the 6 joint angles
and the gripper angle.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/dvrk-bin-picking.png" alt="Figure 6: Needle Pick and Place Task." /><br />
<i>
Needle pick and place task on the surgical robot.
</i>
</p>
<p>In 10 trials, 7/10 were successful. The main failure mode was unsuccessful
grasping due to picking either no needles or multiple needles. As the piles were
cleared and became sparser, the robot’s grasping policy became somewhat brittle.
The grasp success rate was 66% on 99 attempted grasps. In contrast, we rarely
observed failures at the other aspects of the task, reaching 97% successful
recovery on 34 failed grasps.</p>
<p>The learned options are interpretable on intuitive task boundaries. For each of
the 4 options, we plot how heavily the different inputs are weighted (image,
grasp, or kin) in computing the option’s action. Nonzero values of the ReLU
units are marked in white and indicate input relevance:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/surgical_robots/ddco-activations.png" alt="Figure 7: DDCO Options." /><br />
<i>
We plot the average activations of the feature layer for of each option,
indicating which inputs (image, gripper angle, or kinematics ) are relevant to
the policy and termination.
</i>
</p>
<p>We see that the options are clearly specialized. The first option has a strong
dependence only on the grasp candidate, the second option attends almost
exclusively to the image, while the last two options rely mostly on the
kinematics and grasp features.</p>
<h1 id="conclusion">Conclusion</h1>
<p>To summarize, learning sequential task structure from demonstrations has many
applications in robotics such as automating surgical sub-tasks and can be
facilitated by segmenting to learning task structure. We see several avenues for
future work: (1) representations that better model rotational geometry and
configuration spaces, (2) hybrid schemes that consider both parametrized
primitives and those derived from analytic formulae, and (3) consideration of
state-space segmentation as well as temporal segmentation.</p>
<hr />
<h2 id="references">References</h2>
<p>(For links to papers, see the homepages of <a href="https://www.ocf.berkeley.edu/~sanjayk/">Sanjay
Krishnan</a> or <a href="http://goldberg.berkeley.edu/pubs/">Ken
Goldberg</a>.)</p>
<p>Sanjay Krishnan*, Roy Fox*, Ion Stoica, Ken Goldberg. DDCO: Discovery of Deep
Continuous Options for Robot Learning from Demonstrations. Conference on Robot
Learning (CoRL). 2017.</p>
<p>Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen Thananjeyan, Lauren Miller,
Florian T. Pokorny, Ken Goldberg. SWIRL: A Sequential Windowed Inverse
Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards. Workshop
on Algorithmic Foundations of Robotics (WAFR) 2016.</p>
<p>Sanjay Krishnan*, Animesh Garg*, Sachin Patil, Colin Lea, Gregory Hager,
Pieter Abbeel, Ken Goldberg. Transition State Clustering: Unsupervised Surgical
Task Segmentation For Robot Learning. International Symposium on Robotics
Research (ISRR). 2015.</p>
<p>Adithyavairavan Murali*, Siddarth Sen*, Ben Kehoe, Animesh Garg, Seth McFarland,
Sachin Patil, W. Douglas Boyd, Susan Lim, Pieter Abbeel, Ken Goldberg. Learning
by Observation for Surgical Subtasks: Multilateral Cutting of 3D Viscoelastic
and 2D Orthotropic Tissue Phantoms. International Conference on Robotics and
Automation (ICRA). May 2015.</p>
<h2 id="external-references">External References</h2>
<p>Richard Sutton. Temporal credit assignment in reinforcement learning. 1984.</p>
<p>Richard Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial
intelligence. 1999.</p>
<p>Lerrel Pinto, and Abhinav Gupta. Supersizing self-supervision: Learning to grasp
from 50k tries and 700 robot hours. International Conference on Robotics and
Automation (ICRA). 2016.</p>
<p>Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, Deirdre Quillen.
Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and
Large-Scale Data Collection (International Journal of Robotics Research). 2017.</p>
<p>Sergey Levine*, Chelsea Finn*, Trevor Darrell, and Pieter Abbeel. End-to-end
training of deep visuomotor policies. Journal of Machine Learning Research
(JMLR). 2016.</p>
Tue, 17 Oct 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/10/17/lfd-surgical-robots/
http://bair.berkeley.edu/blog/2017/10/17/lfd-surgical-robots/Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning<p>Deep reinforcement learning (deep RL) has achieved success in many tasks, such as playing video games from raw pixels (Mnih et al., 2015), playing the game of Go (Silver et al., 2016), and simulated robotic locomotion (e.g. Schulman et al., 2015). Standard deep RL algorithms aim to master a single way to solve a given task, typically the first way that seems to work well. Therefore, training is sensitive to randomness in the environment, initialization of the policy, and the algorithm implementation. This phenomenon is illustrated in Figure 1, which shows two policies trained to optimize a reward function that encourages forward motion: while both policies have converged to a high-performing gait, these gaits are substantially different from each other.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_1_walker_two_gaits_v2.gif" alt="Figure 1: Trained simulated walking robots." /><br />
<i>
Figure 1: Trained simulated walking robots.<br />
[credit: John Schulman and Patrick Coady (<a href="https://gym.openai.com/envs/Walker2d-v1/">OpenAI Gym)</a>]
</i>
</p>
<!--more-->
<p>Why might finding only a single solution be undesirable? Knowing only one way to act makes agents vulnerable to environmental changes that are common in the real-world. For example, consider a robot (Figure 2) navigating its way to the goal (blue cross) in a simple maze. At training time (Figure 2a), there are two passages that lead to the goal. The agent will likely commit to the solution via the upper passage as it is slightly shorter. However, if we change the environment by blocking the upper passage with a wall (Figure 2b), the solution the agent has found becomes infeasible. Since the agent focused entirely on the upper passage during learning, it has almost no knowledge of the lower passage. Therefore, adapting to the new situation in Figure 2b requires the agent to relearn the entire task from scratch.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_2a_maze_one_path.png" alt="maze_one_path" width="300" /><p class="center">2a</p>
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_2b_maze-two-paths.png" alt="maze-two-paths" width="300" /><p class="center">2b</p>
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
Figure 2: A robot navigating a maze.
</i>
</p>
<h3 id="maximum-entropy-policies-and-their-energy-forms">Maximum Entropy Policies and Their Energy Forms</h3>
<p>Let us begin with a review of RL: an agent interacts with an environment by iteratively observing the current <em>state</em> ($\mathbf{s}$), taking an <em>action</em> ($\mathbf{a}$), and receiving a <em>reward</em> ($r$). It employs a (stochastic) policy ($\pi$) to select actions, and finds the best policy that maximizes the cumulative reward it collects throughout an episode of length $T$:</p>
<script type="math/tex; mode=display">\pi^* = \arg\!\max_{\pi} \mathbb{E}_{\pi}\left[ \sum_{t=0}^T r_t \right]</script>
<p>We define the Q-function, $Q(\mathbf{s},\mathbf{a})$, as the expected cumulative reward after taking action a at state s. Consider the robot in Figure 2a again. When the robot is in the initial state, the Q-function may look like the one depicted in Figure 3a (grey curve), with two distinct modes corresponding to the two passages. A conventional RL approach is to specify a unimodal policy distribution, centered at the maximal Q-value and extending to the neighbouring actions to provide noise for exploration (red distribution). Since the exploration is biased towards the upper passage, the agent refines its policy there and ignores the lower passage completely.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_3a_unimodal-policy.png" alt="unimodal-policy" width="300" /><p class="center">3a</p>
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_3b_multimodal_policy.png" alt="multimodal_policy" width="300" /><p class="center">3b</p>
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
Figure 3: A multimodal Q-function.
</i>
</p>
<p>An obvious solution, at the high level, is to ensure the agent explores all promising states while prioritizing the more promising ones. One way to formalize this idea is to define the policy directly in terms of exponentiated Q-values (Figure 3b, green distribution):</p>
<script type="math/tex; mode=display">\pi(\mathbf{a}|\mathbf{s}) \propto \exp Q(\mathbf{s}, \mathbf{a})</script>
<p>This density has the form of the Boltzmann distribution, where the Q-function serves as the negative energy, which assigns a non-zero likelihood to all actions. As a consequence, the agent will become aware of all behaviours that lead to solving the task, which can help the agent adapt to changing situations in which some of the solutions might have become infeasible. In fact, we can show that the policy defined through the energy form is an optimal solution for the maximum-entropy RL objective</p>
<script type="math/tex; mode=display">\pi_{\mathrm{MaxEnt}}^* = \arg\!\max_{\pi} \mathbb{E}_{\pi}\left[ \sum_{t=0}^T r_t + \mathcal{H}(\pi(\cdot | \mathbf{s}_t)) \right]</script>
<p>which simply augments the conventional RL objective with the entropy of the policy (Ziebart 2010).</p>
<p>The idea of learning such <a href="https://en.wikipedia.org/wiki/Principle_of_maximum_entropy">maximum entropy models</a> has its origin in statistical modeling, in which the goal is to find the probability distribution that has the highest entropy while still satisfying the observed statistics. For example, if the distribution is on the Euclidean space and the observed statistics are the mean and the covariance, then the maximum entropy distribution is a Gaussian with the corresponding mean and covariance. In practice, we prefer maximum-entropy models as they assume the least about the unknowns while matching the observed information.</p>
<p>A number of prior works have employed the maximum-entropy principle in the context of reinforcement learning and optimal control. Ziebart (2008) used the maximum entropy principle to resolve ambiguities in inverse reinforcement learning, where several reward functions can explain the observed demonstrations. Several works (Todorov 2008; Toussaint, 2009]) have studied the connection between inference and control via the maximum entropy formulation. Todorov (2007, 2009) also showed how the maximum entropy principle can be employed to make MDPs linearly solvable, and Fox et al. (2016) utilized the principle as a means to incorporate prior knowledge into a reinforcement learning policy.</p>
<h2 id="soft-bellman-equation-and-soft-q-learning">Soft Bellman Equation and Soft Q-Learning</h2>
<p>We can obtain the optimal solution of the maximum entropy objective by employing the <em>soft Bellman equation</em></p>
<script type="math/tex; mode=display">Q(\mathbf{s}_t, \mathbf{a}_t) = \mathbb{E}\left[r_t + \gamma\ \mathrm{softmax}_{\mathbf{a}} Q(\mathbf{s}_{t+1}, \mathbf{a})\right]</script>
<p>where</p>
<script type="math/tex; mode=display">\mathrm{softmax}_{\mathbf{a}} f(\mathbf{a}) := \log \int \exp f(\mathbf{a}) \, d\mathbf{a}</script>
<p>The soft Bellman equation can be shown to hold for the optimal Q-function of the entropy augmented reward function (e.g. Ziebart 2010). Note the similarity to the conventional Bellman equation, which instead has the hard max of the Q-function over the actions instead of the softmax. Like the hard version, the soft Bellman equation is a contraction, which allows solving for the Q-function using dynamic programming or model-free TD learning in tabular state and action spaces (e.g. Ziebart, 2008; Rawlik, 2012; Fox, 2016).</p>
<p>However, in continuous domains, there are two major challenges. First, exact dynamic programming is infeasible, since the soft Bellman equation needs to hold for every state and action, and the softmax involves integrating over the entire action space. Second, the optimal policy is defined by an intractable energy-based distribution, which is difficult to sample from. To address the first challenge, we can employ expressive neural network function approximators, which can be trained with stochastic gradient descent on sampled states and actions and then generalize effectively to new state-action tuples. To address the second challenge, we can employ approximate inference techniques, such as Markov chain Monte Carlo, which has been explored in prior works for energy-based policies (Heess, 2012). To accelerate inference, we use the amortized Stein variational gradient descent (Wang and Liu, 2016) to train an inference network to generate approximate samples. The resulting algorithm, termed <em>soft Q-learning</em>, combines deep Q-learning and the amortized Stein variational gradient descent.</p>
<h2 id="application-to-reinforcement-learning">Application to Reinforcement Learning</h2>
<p>Now that we can learn maximum entropy policies via soft Q-learning, we might wonder: what are the practical uses of this approach? In the following sections, we illustrate with experiments that soft Q-learning allows for better exploration, enables policy transfer between similar tasks, allows new policies to be easily composed from existing policies, and improves robustness through extensive exploration at training time.</p>
<h3 id="better-exploration">Better Exploration</h3>
<p>Soft Q-learning (SQL) provides us with an implicit exploration strategy by assigning each action a non-zero probability, shaped by the current belief about its value, effectively combining exploration and exploitation in a natural way. To see this, let us consider a two-passage maze (Figure 4) similar to the one discussed in the introduction. The task is to find a way to the goal state, denoted by a blue square. Suppose that the reward is proportional to the distance to the goal. Since the maze is almost symmetric, such a reward results in a bimodal objective, but only one of the modes corresponds to an actual solution to the task. Thus, exploring both passages at training time is crucial to discover which of the two is really best. A unimodal policy can only solve this task if it is lucky enough to commit to the lower passage from the start. On the other hand, a multimodal soft Q-learning policy can solve the task consistently by following both passages randomly until the agent finds the goal (Figure 4).</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_4_ant_maze.gif" alt="A policy trained with soft Q-learning." /><br />
<i>
Figure 4: A policy trained with soft Q-learning can explore both passages during training.
</i>
</p>
<h3 id="fine-tuning-maximum-entropy-policies">Fine-Tuning Maximum Entropy Policies</h3>
<p>The standard practice in RL is to train an agent from scratch for each new task. This can be slow because the agent throws away knowledge acquired from previous tasks. Instead, the agent can transfer skills from similar previous tasks, allowing it to learn new tasks more quickly. One way to transfer skills is to pre-train policies for general purpose tasks, and then use them as templates or initializations for more specific tasks. For example, the skill of walking subsumes the skill of navigating through a maze, and therefore the walking skill can serve as an efficient initialization for learning the navigation skill. To illustrate this idea, we trained a maximum entropy policy by rewarding the agent for walking at a high speed, regardless of the direction. The resulting policy learns to walk, but does not commit to any single direction due to the maximum entropy objective (Figure 5a). Next, we specialized the walking skill to a range of navigation skills, such as the one in Figure 5b. In the new task, the agent only needs to choose which walking behavior will move itself closer to the goal, which is substantially easier than learning the same skill from scratch. A conventional policy would converge to a specific behaviour when trained for the general task. For example, it may only learn to walk in a single direction. Consequently, it cannot directly transfer the walking skill to the maze environment, which requires movement in multiple directions.</p>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_5a_pretrain_softql_small.gif" alt="pretrain_softql_small" width="200" /><p class="center">5a</p>
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_5b_finetune_ushape_2.gif" alt="finetune_ushape_2" width="350" /><p class="center">5b</p>
</td>
</tr>
</table>
<p style="text-align:center;">
<i>
Figure 5: Maximum entropy pretraining allows agents to learn more quickly in new environments. Videos of the same pretrained policy fine-tuned for other target tasks can be found at <a href="https://www.youtube.com/watch?v=7Nm1N6sUoVs&feature=youtu.be">this</a> link.
</i>
</p>
<h3 id="compositionality">Compositionality</h3>
<p>In a similar vein to general-to-specific transfer, we can compose new skills from existing policies—even without any fine-tuning—by intersecting different skills. The idea is simple: take two soft policies, each corresponding to a different set of behaviors, and combine them by adding together their Q-functions. In fact, it is possible to show that the combined policy is approximately optimal for the combined task, obtained by simply adding the reward functions of the constituent tasks, up to a bounded error. Consider a planar manipulator as the one pictured below. The two agents on the left are trained to move the cylindrical object to a target location illustrated with red stripes. Note, how the solution space of the two tasks overlap: by moving the cylinder to the intersection of the stripes, both tasks can be solved simultaneously. Indeed, the policy on the right, which is obtained by simply summing together the two Q-functions, moves the cylinder to the intersection, without the need to train a policy explicitly for the combined task. Conventional policies do not exhibit the same compositionality property as they can only represent specific, disjoint solutions.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_6_composition_small.gif" alt="Combining two skills into a new one" /><br />
<i>
Figure 6: Combining two skills into a new one.
</i>
</p>
<h3 id="robustness">Robustness</h3>
<p>Because the maximum entropy formulation encourages agents to try all possible solutions, the agents learn to explore a large portion of the state space. Thus they learn to act in various situations, and are more robust against perturbations in the environment. To illustrate this, we trained a Sawyer robot to stack Lego blocks together by specifying a target end-effector pose. Figure 7 shows some snapshots during training.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_7_sawyer_training_white_bg.gif" alt="Training to stack Lego blocks with soft Q-learning." /><br />
<i>
Figure 7: Training to stack Lego blocks with soft Q-learning.<br />
[credit: Aurick Zhou]
</i>
</p>
<p>The robot succeeded for the first time after 30 minutes; after an hour, it was able to stack the blocks consistently; and after two hours, the policy had fully converged. The converged policy is also robust to perturbations as shown in the video below, in which the arm is perturbed into configurations that are very different from what it encounters during normal execution, and it is able to successfully recover every time.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/softq/figure_8_sawyer_bully_policy.gif" alt="The trained policy is robust to perturbations" /><br />
<i>
Figure 8: The trained policy is robust to perturbations.
</i>
</p>
<h2 id="related-work">Related Work</h2>
<p>Soft optimality has also been studied in recent papers in the context of learning from multi-step transitions (Nachum et al., 2017) and its connection to policy gradient methods (Schulman et al., 2017). A related concept is discussed by O’Donoghue et al. (2016), who also consider entropy regularization and Boltzmann exploration. This version of entropy regularization only considers the entropy of the current state, and does not take into account the entropy for the future states.</p>
<p>To our knowledge, only a few prior works have demonstrated successful model-free reinforcement learning directly on real-world robots. Gu et al. (2016) showed that NAF could learn door opening tasks, using about 2.5 hours of experience parallelized across two robots. Rusu et al. (2016) used RL to train a robot arm to reach a red square, with pretraining in simulation. Večerı́k et al. (2017) showed that, if initialized from demonstration, a Sawyer robot could perform a peg-insertion style task with about 30 minutes of experience. It is worth noting that our soft Q-learning results, shown above, used only a single robot for training, and did not use any simulation or demonstrations.</p>
<hr />
<p>We would like to thank Sergey Levine, Pieter Abbeel, and Gregory Kahn for their valuable feedback when preparing this blog post.</p>
<p>This post is based on the following paper: <br />
<strong>Reinforcement Learning with Deep Energy-Based Policies</strong> <br />
Haarnoja T., Tang H., Abbeel P., Levine S. <em>ICML 2017</em>.<br />
<a href="https://arxiv.org/abs/1702.08165">paper</a>, <a href="https://github.com/haarnoja/softqlearning">code</a>, <a href="https://sites.google.com/view/softqlearning/home">videos</a></p>
<h2 id="references">References</h2>
<p><strong>Related concurrent papers</strong></p>
<ul>
<li>Schulman, J., Abbeel, P. and Chen, X. Equivalence Between Policy Gradients and Soft Q-Learning. <em>arXiv preprint arXiv:1704.06440</em>, 2017.</li>
<li>Nachum, O., Norouzi, M., Xu, K. and Schuurmans, D. Bridging the Gap Between Value and Policy Based Reinforcement Learning. <em>NIPS 2017</em>.</li>
</ul>
<p><strong>Papers leveraging the maximum entropy principle</strong></p>
<ul>
<li>Kappen, H. J. Path integrals and symmetry breaking for optimal control theory. <em>Journal of Statistical Mechanics: Theory And Experiment</em>, 2005(11): P11011, 2005.</li>
<li>Todorov, E. Linearly-solvable Markov decision problems. In <em>Advances in Neural Information Processing Systems</em>, pp. 1369–1376. MIT Press, 2007.</li>
<li>Todorov, E. General duality between optimal control and estimation. In IEEE Conf. on Decision and Control, pp. 4286–4292. IEEE, 2008.</li>
<li>Todorov, E. (2009). Compositionality of optimal control laws. In <em>Advances in Neural Information Processing Systems</em> (pp. 1856-1864).</li>
<li>Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 1433–1438, 2008.</li>
<li>Toussaint, M. Robot trajectory optimization using approximate inference. In <em>Int. Conf. on Machine Learning</em>, pp. 1049–1056. ACM, 2009.</li>
<li>Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, 2010.</li>
<li>Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochastic optimal control and reinforcement learning by approximate inference. <em>Proceedings of Robotics: Science and Systems VIII</em>, 2012.</li>
<li>Fox, R., Pakman, A., and Tishby, N. Taming the noise in reinforcement learning via soft updates. In <em>Conf. on Uncertainty in Artificial Intelligence</em>, 2016.</li>
</ul>
<p><strong>Model-free RL in the real-world</strong></p>
<ul>
<li>Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep Q-learning with model-based acceleration. In Int. Conf. on Machine Learning, pp. 2829–2838, 2016.</li>
<li>M. Večerı́k, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” <em>arXiv preprint arXiv:1707.08817</em>, 2017.</li>
</ul>
<p><strong>Other references</strong><br />
Heess, N., Silver, D., and Teh, Y.W. Actor-critic reinforcement learning with energy-based policies. In Workshop on Reinforcement Learning, pp. 43. Citeseer, 2012.</p>
<p>Jaynes, E.T. Prior probabilities. IEEE Transactions on systems science and cybernetics, 4(3), pp.227-241, 1968.</p>
<p>Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ICLR 2016.</p>
<p>Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In <em>Advances In Neural Information Processing Systems</em>, pp. 2370–2378, 2016.</p>
<p>Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A, Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. <em>Nature</em>, 518 (7540):529–533, 2015.</p>
<p>Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In <em>International Conference on Machine Learning</em> (pp. 1928-1937), 2016.</p>
<p>O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. PGQ: Combining policy gradient and Q-learning. <em>arXiv preprint arXiv:1611.01626</em>, 2016.</p>
<p>Rusu, A. A., Vecerik, M., Rothörl, T., Heess, N., Pascanu, R. and Hadsell, R., Sim-to-real robot learning from pixels with progressive nets. <em>arXiv preprint arXiv:1610.04286</em>, 2016.</p>
<p>Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015.</p>
<p>Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S. Mastering the game of Go with deep neural networks and tree search. <em>Nature</em>, 529(7587), 484-489, 2016.</p>
<p>Sutton, R. S. and Barto, A. G. <em>Reinforcement learning: An introduction</em>, volume 1. MIT press Cambridge, 1998.</p>
<p>Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W. and Abbeel, P. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. <em>arXiv preprint arXiv:1703.06907</em>, 2017.</p>
<p>Wang, D., and Liu, Q. Learning to draw samples: With application to amortized MLE for generative adversarial learning. <em>arXiv preprint arXiv:1611.01722</em>, 2016.</p>
<!-- [Mnih2015]:
[Silver2016]:
[Schulman2015]:
[ZieBart2010]: -->
<!-- [Ziebart2008]:
[Todorov2008]:
[Toussaint2009]:
[Todorov2007]:
[Todorov2009]:
[Fox2016]:
[Rawlik2012]:
[Heess2012]:
[WangLiu2016]:
[Nachum2017]:
[Schulman2017]:
[ODonoghue2016]:
[Gu2016]:
[Rusu2016]:
[Vecerik2017]: -->
Fri, 06 Oct 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/
http://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/Learning to Optimize with Reinforcement Learning<p><em>Since we posted our paper on “<a href="https://arxiv.org/abs/1606.01885">Learning to Optimize</a>” last year, the area of optimizer learning has received growing attention. In this article, we provide an introduction to this line of work and share our perspective on the opportunities and challenges in this area.</em></p>
<p>Machine learning has enjoyed tremendous success and is being applied to a wide variety of areas, both in AI and beyond. This success can be attributed to the data-driven philosophy that underpins machine learning, which favours automatic discovery of patterns from data over manual design of systems using expert knowledge.</p>
<p>Yet, there is a paradox in the current paradigm: the algorithms that power machine learning are still designed manually. This raises a natural question: can we <em>learn</em> these algorithms instead? This could open up exciting possibilities: we could find new algorithms that perform better than manually designed algorithms, which could in turn improve learning capability.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/teaser.png" alt="The learned optimizer could potentially pick better update steps than traditional optimizers." />
</p>
<!--more-->
<p>Doing so, however, requires overcoming a fundamental obstacle: how do we parameterize the space of algorithms so that it is both (1) expressive, and (2) efficiently searchable? Various ways of representing algorithms trade off these two goals. For example, if the space of algorithms is represented by a small set of known algorithms, it most likely does not contain the best possible algorithm, but does allow for efficient searching via simple enumeration of algorithms in the set. On the other hand, if the space of algorithms is represented by the set of all possible programs, it contains the best possible algorithm, but does not allow for efficient searching, as enumeration would take exponential time.</p>
<p>One of the most common types of algorithms used in machine learning is continuous optimization algorithms. Several popular algorithms exist, including gradient descent, momentum, AdaGrad and ADAM. We consider the problem of automatically designing such algorithms. Why do we want to do this? There are two reasons: first, many optimization algorithms are devised under the assumption of convexity and applied to non-convex objective functions; by learning the optimization algorithm under the same setting as it will actually be used in practice, the learned optimization algorithm could hopefully achieve better performance. Second, devising new optimization algorithms manually is usually laborious and can take months or years; learning the optimization algorithm could reduce the amount of manual labour.</p>
<h2 id="-learning-to-optimize"><a name="framework"></a> Learning to Optimize</h2>
<p>In our paper last year (<a href="https://arxiv.org/abs/1606.01885">Li & Malik, 2016</a>), we introduced a framework for learning optimization algorithms, known as “Learning to Optimize”. We note that soon after our paper appeared, (<a href="https://arxiv.org/abs/1606.04474">Andrychowicz et al., 2016</a>) also independently proposed a similar idea.</p>
<p>Consider how existing continuous optimization algorithms generally work. They operate in an iterative fashion and maintain some iterate, which is a point in the domain of the objective function. Initially, the iterate is some random point in the domain; in each iteration, a step vector is computed using some fixed update formula, which is then used to modify the iterate. The update formula is typically some function of the history of gradients of the objective function evaluated at the current and past iterates. For example, in gradient descent, the update formula is some scaled negative gradient; in momentum, the update formula is some scaled exponential moving average of the gradients.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/alg_structure.png" alt="Optimization algorithms start at a random point and iteratively update it with a step vector computed using a fixed update formula." />
</p>
<p>What changes from algorithm to algorithm is this update formula. So, if we can learn the update formula, we can learn an optimization algorithm. We model the update formula as a neural net. Thus, by learning the weights of the neural net, we can learn an optimization algorithm. Parameterizing the update formula as a neural net has two appealing properties mentioned earlier: first, it is expressive, as neural nets are universal function approximators and can in principle model any update formula with sufficient capacity; second, it allows for efficient search, as neural nets can be trained easily with backpropagation.</p>
<p>In order to learn the optimization algorithm, we need to define a performance metric, which we will refer to as the “meta-loss”, that rewards good optimizers and penalizes bad optimizers. Since a good optimizer converges quickly, a natural meta-loss would be the sum of objective values over all iterations (assuming the goal is to minimize the objective function), or equivalently, the cumulative regret. Intuitively, this corresponds to the area under the curve, which is larger when the optimizer converges slowly and smaller otherwise.</p>
<h2 id="learning-to-learn">Learning to Learn</h2>
<p>Consider the special case when the objective functions are loss functions for training other models. Under this setting, optimizer learning can be used for “learning to learn”. For clarity, we will refer to the model that is trained using the optimizer as the “base-model” and prefix common terms with “base-” and “meta-” to disambiguate concepts associated with the base-model and the optimizer respectively.</p>
<p>What do we mean exactly by “learning to learn”? While this term has appeared from time to time in the literature, different authors have used it to refer to different things, and there is no consensus on its precise definition. Often, it is also used interchangeably with the term “meta-learning”.</p>
<p>The term traces its origins to the idea of metacognition (<a href="http://classics.mit.edu/Aristotle/soul.html">Aristotle, 350 BC</a>), which describes the phenomenon that humans not only reason, but also reason about their own process of reasoning. Work on “learning to learn” draws inspiration from this idea and aims to turn it into concrete algorithms. Roughly speaking, “learning to learn” simply means learning <em>something</em> about learning. What is learned at the meta-level differs across methods. We can divide various methods into three broad categories according to the type of meta-knowledge they aim to learn:</p>
<ul>
<li>Learning <em>What</em> to Learn</li>
<li>Learning <em>Which Model</em> to Learn</li>
<li>Learning <em>How</em> to Learn</li>
</ul>
<h3 id="learning-what-to-learn">Learning <em>What</em> to Learn</h3>
<p>These methods aim to learn some particular values of base-model parameters that are useful across a family of related tasks (<a href="https://books.google.com/books?isbn=1461555299">Thrun & Pratt, 2012</a>). The meta-knowledge captures commonalities across the family, so that base-learning on a new task from the family can be done more quickly. Examples include methods for transfer learning, multi-task learning and few-shot learning. Early methods operate by partitioning the parameters of the base-model into two sets: those that are specific to a task and those that are common across tasks. For example, a popular approach for neural net base-models is to share the weights of the lower layers across all tasks, so that they capture the commonalities across tasks. See <a href="/blog/2017/07/18/learning-to-learn/">this post</a> by Chelsea Finn for an overview of the more recent methods in this area.</p>
<h3 id="learning-which-model-to-learn">Learning <em>Which Model</em> to Learn</h3>
<p>These methods aim to learn which base-model is best suited for a task (<a href="https://books.google.com/books?isbn=3540732632">Brazdil et al., 2008</a>). The meta-knowledge captures correlations between different base-models and their performance on different tasks. The challenge lies in parameterizing the space of base-models in a way that is expressive and efficiently searchable, and in parameterizing the space of tasks that allows for generalization to unseen tasks. Different methods make different trade-offs between expressiveness and searchability: (<a href="https://link.springer.com/article/10.1023/A:1021713901879">Brazdil et al., 2003</a>) uses a database of predefined base-models and exemplar tasks and outputs the base-model that performed the best on the nearest exemplar task. While this space of base-models is searchable, it does not contain good but yet-to-be-discovered base-models. (<a href="https://link.springer.com/article/10.1023/B:MACH.0000015880.99707.b2">Schmidhuber, 2004</a>) represents each base-model as a general-purpose program. While this space is very expressive, searching in this space takes exponential time in the length of the target program. (<a href="https://link.springer.com/chapter/10.1007/3-540-44668-0_13">Hochreiter et al., 2001</a>) views an algorithm that trains a base-model as a black box function that maps a sequence of training examples to a sequence of predictions and models it as a recurrent neural net. Meta-training then simply reduces to training the recurrent net. Because the base-model is encoded in the recurrent net’s memory state, its capacity is constrained by the memory size. A related area is hyperparameter optimization, which aims for a weaker goal and searches over base-models parameterized by a predefined set of hyperparameters. It needs to generalize across hyperparameter settings (and by extension, base-models), but not across tasks, since multiple trials with different hyperparameter settings on the same task are allowed.</p>
<h3 id="learning-how-to-learn">Learning <em>How</em> to Learn</h3>
<p>While methods in the previous categories aim to learn about the <em>outcome</em> of learning, methods in this category aim to learn about the <em>process</em> of learning. The meta-knowledge captures commonalities in the behaviours of learning algorithms. There are three components under this setting: the base-model, the base-algorithm for training the base-model, and the meta-algorithm that learns the base-algorithm. What is learned is not the base-model itself, but the base-algorithm, which trains the base-model on a task. Because both the base-model and the task are given by the user, the base-algorithm that is learned must work on a range of different base-models and tasks. Since most learning algorithms optimize some objective function, learning the base-algorithm in many cases reduces to learning an optimization algorithm. This problem of learning optimization algorithms was explored in (<a href="https://arxiv.org/abs/1606.01885">Li & Malik, 2016</a>), (<a href="https://arxiv.org/abs/1606.04474">Andrychowicz et al., 2016</a>) and a number of subsequent papers. Closely related to this line of work is (<a href="http://ieeexplore.ieee.org/abstract/document/155621">Bengio et al., 1991</a>), which learns a Hebb-like synaptic learning rule. The learning rule depends on a subset of the dimensions of the current iterate encoding the activities of neighbouring neurons, but does not depend on the objective function and therefore does not have the capability to generalize to different objective functions.</p>
<h2 id="generalization">Generalization</h2>
<p>Learning of any sort requires training on a finite number of examples and generalizing to the broader class from which the examples are drawn. It is therefore instructive to consider what the examples and the class correspond to in our context of learning optimizers for training base-models. Each example is an objective function, which corresponds to the loss function for training a base-model on a task. The task is characterized by a set of examples and target predictions, or in other words, a dataset, that is used to train the base-model. The meta-training set consists of multiple objective functions and the meta-test set consists of different objective functions drawn from the same class. Objective functions can differ in two ways: they can correspond to different base-models, or different tasks. Therefore, generalization in this context means that the learned optimizer works on different base-models and/or different tasks.</p>
<h3 id="why-is-generalization-important">Why is generalization important?</h3>
<p>Suppose for moment that we didn’t care about generalization. In this case, we would evaluate the optimizer on the same objective functions that are used for training the optimizer. If we used only one objective function, then the best optimizer would be one that simply memorizes the optimum: this optimizer always converges to the optimum in one step regardless of initialization. In our context, the objective function corresponds to the loss for training a particular base-model on a particular task and so this optimizer essentially memorizes the optimal weights of the base-model. Even if we used many objective functions, the learned optimizer could still try to identify the objective function it is operating on and jump to the memorized optimum as soon as it does.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/memorization.png" alt="At training time, the optimizer can memorize the optimum. At test time, it can jump directly to the optimum." />
</p>
<p>Why is this problematic? Memorizing the optima requires finding them in the first place, and so learning an optimizer takes longer than running a traditional optimizer like gradient descent. So, for the purposes of finding the optima of the objective functions at hand, running a traditional optimizer would be faster. Consequently, it would be pointless to learn the optimizer if we didn’t care about generalization.</p>
<p>Therefore, for the learned optimizer to have any practical utility, it must perform well on new objective functions that are different from those used for training.</p>
<h3 id="what-should-be-the-extent-of-generalization">What should be the extent of generalization?</h3>
<p>If we only aim for generalization to <em>similar</em> base-models on <em>similar</em> tasks, then the learned optimizer could memorize parts of the optimal weights that are common across the base-models and tasks, like the weights of the lower layers in neural nets. This would be essentially the same as learning-<em>what</em>-to-learn formulations like transfer learning.</p>
<p>Unlike learning <em>what</em> to learn, the goal of learning <em>how</em> to learn is to learn not what the optimum is, but how to find it. We must therefore aim for a stronger notion of generalization, namely generalization to similar base-models on dissimilar tasks. An optimizer that can generalize to <em>dissimilar</em> tasks cannot just partially memorize the optimal weights, as the optimal weights for dissimilar tasks are likely completely different. For example, not even the lower layer weights in neural nets trained on MNIST(a dataset consisting of black-and-white images of handwritten digits) and CIFAR-10(a dataset consisting of colour images of common objects in natural scenes) likely have anything in common.</p>
<p>Should we aim for an even stronger form of generalization, that is, generalization to <em>dissimilar</em> base-models on dissimilar tasks? Since these correspond to objective functions that bear no similarity to objective functions used for training the optimizer, this is essentially asking if the learned optimizer should generalize to objective functions that could be arbitrarily different.</p>
<p>It turns out that this is impossible. Given any optimizer, we consider the trajectory followed by the optimizer on a particular objective function. Because the optimizer only relies on information at the previous iterates, we can modify the objective function at the last iterate to make it arbitrarily bad while maintaining the geometry of the objective function at all previous iterates. Then, on this modified objective function, the optimizer would follow the exact same trajectory as before and end up at a point with a bad objective value. Therefore, any optimizer has objective functions that it performs poorly on and no optimizer can generalize to all possible objective functions.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/impossibility.png" alt="Take any optimizer and run it on some objective function. We can always manipulate the objective function by making the objective value at the last iteration arbitrarily high, while maintaining the geometry at all previous iterations. The same optimizer must perform poorly on this new objective function." />
</p>
<p>If no optimizer is universally good, can we still hope to learn optimizers that are useful? The answer is yes: since we are typically interested in optimizing functions from certain special classes in practice, it is possible to learn optimizers that work well on these classes of interest. The objective functions in a class can share regularities in their geometry, e.g.: they might have in common certain geometric properties like convexity, piecewise linearity, Lipschitz continuity or other unnamed properties. In the context of learning-<em>how</em>-to-learn, each class can correspond to a type of base-model. For example, neural nets with ReLU activation units can be one class, as they are all piecewise linear. Note that when learning the optimizer, there is no need to explicitly characterize the form of geometric regularity, as the optimizer can learn to exploit it automatically when trained on objective functions from the class.</p>
<h2 id="how-to-learn-the-optimizer">How to Learn the Optimizer</h2>
<p>The first approach we tried was to treat the problem of learning optimizers as a standard supervised learning problem: we simply differentiate the meta-loss with respect to the parameters of the update formula and learn these parameters using standard gradient-based optimization. (We weren’t the only ones to have thought of this; (<a href="https://arxiv.org/abs/1606.04474">Andrychowicz et al., 2016</a>) also used a similar approach.)</p>
<p>This seemed like a natural approach, but it did not work: despite our best efforts, we could not get any optimizer trained in this manner to generalize to unseen objective functions, even though they were drawn from the same distribution that generated the objective functions used to train the optimizer. On almost all unseen objective functions, the learned optimizer started off reasonably, but quickly diverged after a while. On the other hand, on the training objective functions, it exhibited no such issues and did quite well. Why is this?</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/sl_performance.png" alt="An optimizer trained using supervised learning initially does reasonably well, but diverges in later iterations." />
</p>
<p>It turns out that optimizer learning is not as simple a learning problem as it appears. Standard supervised learning assumes all training examples are independent and identically distributed (i.i.d.); in our setting, the step vector the optimizer takes at any iteration affects the gradients it sees at all subsequent iterations. Furthermore, <em>how</em> the step vector affects the gradient at the subsequent iteration is not known, since this depends on the local geometry of the objective function, which is unknown at meta-test time. Supervised learning cannot operate in this setting, and must assume that the local geometry of an unseen objective function is the same as the local geometry of training objective functions at all iterations.</p>
<p>Consider what happens when an optimizer trained using supervised learning is used on an unseen objective function. It takes a step, and discovers at the next iteration that the gradient is different from what it expected. It then recalls what it did on the training objective functions when it encountered such a gradient, which could have happened in a completely different region of the space, and takes a step accordingly. To its dismay, it finds out that the gradient at the next iteration is even more different from what it expected. This cycle repeats and the error the optimizer makes becomes bigger and bigger over time, leading to rapid divergence.</p>
<p>This phenomenon is known in the literature as the problem of <em>compounding errors</em>. It is known that the total error of a supervised learner scales quadratically in the number of iterations, rather than linearly as would be the case in the i.i.d. setting (<a href="http://proceedings.mlr.press/v9/ross10a.html">Ross and Bagnell, 2010</a>). In essence, an optimizer trained using supervised learning necessarily overfits to the geometry of the training objective functions. One way to solve this problem is to use reinforcement learning.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/rl_performance.png" alt="An optimizer trained using reinforcement learning does not diverge in later iterations." />
</p>
<h2 id="background-on-reinforcement-learning">Background on Reinforcement Learning</h2>
<p>Consider an environment that maintains a state, which evolves in an unknown fashion based on the action that is taken. We have an agent that interacts with this environment, which sequentially selects actions and receives feedback after each action is taken on how good or bad the new state is. The goal of reinforcement learning is to find a way for the agent to pick actions based on the current state that leads to good states on average.</p>
<p>More precisely, a reinforcement learning problem is characterized by the following components:</p>
<ul>
<li>A state space, which is the set of all possible states,</li>
<li>An action space, which is the set of all possible actions,</li>
<li>A cost function, which measures how bad a state is,</li>
<li>A time horizon, which is the number of time steps,</li>
<li>An initial state probability distribution, which specifies how frequently different states occur at the beginning before any action is taken, and</li>
<li>A state transition probability distribution, which specifies how the state changes (probabilistically) after a particular action is taken.</li>
</ul>
<p>While the learning algorithm is aware of what the first five components are, it does not know the last component, i.e.: how states evolve based on actions that are chosen. At training time, the learning algorithm is allowed to interact with the environment. Specifically, at each time step, it can choose an action to take based on the current state. Then, based on the action that is selected and the current state, the environment samples a new state, which is observed by the learning algorithm at the subsequent time step. The sequence of sampled states and actions is known as a trajectory. This sampling procedure induces a distribution over trajectories, which depends on the initial state and transition probability distributions and the way action is selected based on the current state, the latter of which is known as a <em>policy</em>. This policy is often modelled as a neural net that takes in the current state as input and outputs the action. The goal of the learning algorithm is to find a policy such that the expected cumulative cost of states over all time steps is minimized, where the expectation is taken with respect to the distribution over trajectories.</p>
<h2 id="formulation-as-a-reinforcement-learning-problem">Formulation as a Reinforcement Learning Problem</h2>
<p>Recall the <a href="#framework">learning framework</a> we introduced above, where the goal is to find the update formula that minimizes the meta-loss. Intuitively, we think of the agent as an optimization algorithm and the environment as being characterized by the family of objective functions that we’d like to learn an optimizer for. The state consists of the current iterate and some features along the optimization trajectory so far, which could be some statistic of the history of gradients, iterates and objective values. The action is the step vector that is used to update the iterate.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/rl_formulation.png" alt="The state is the iterate and some statistic of the history of gradients, iterates and objective values. The action is the step vector. Under this formulation, a particular policy corresponds to a particular update formula. The cost is the objective value." />
</p>
<p>Under this formulation, the policy is essentially a procedure that computes the action, which is the step vector, from the state, which depends on the current iterate and the history of gradients, iterates and objective values. In other words, a particular policy represents a particular update formula. Hence, learning the policy is equivalent to learning the update formula, and hence the optimization algorithm. The initial state probability distribution is the joint distribution of the initial iterate, gradient and objective value. The state transition probability distribution characterizes what the next state is likely to be given the current state and action. Since the state contains the gradient and objective value, the state transition probability distribution captures how the gradient and objective value are likely to change for any given step vector. In other words, it encodes the likely local geometries of the objective functions of interest. Crucially, the reinforcement learning algorithm does not have direct access to this state transition probability distribution, and therefore the policy it learns avoids overfitting to the geometry of the training objective functions.</p>
<p>We choose a cost function of a state to be the value of the objective function evaluated at the current iterate. Because reinforcement learning minimizes the cumulative cost over all time steps, it essentially minimizes the sum of objective values over all iterations, which is the same as the meta-loss.</p>
<h2 id="results">Results</h2>
<p>We trained an optimization algorithm on the problem of training a neural net on MNIST, and tested it on the problems of training different neural nets on the Toronto Faces Dataset (TFD), CIFAR-10 and CIFAR-100. These datasets bear little similarity to each other: MNIST consists of black-and-white images of handwritten digits, TFD consists of grayscale images of human faces, and CIFAR-10/100 consists of colour images of common objects in natural scenes. It is therefore unlikely that a learned optimization algorithm can get away with memorizing, say, the lower layer weights, on MNIST and still do well on TFD and CIFAR-10/100.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/results.png" alt="Our algorithm, which is trained on MNIST, consistently outperforms other optimization algorithms on TFD, CIFAR-10 and CIFAR-100." />
</p>
<p>As shown, the optimization algorithm trained using our approach on MNIST (shown in light red) generalizes to TFD, CIFAR-10 and CIFAR-100 and outperforms other optimization algorithms.</p>
<p>To understand the behaviour of optimization algorithms learned using our approach, we trained an optimization algorithm on two-dimensional logistic regression problems and visualized its trajectory in the space of the parameters. It is worth noting that the behaviours of optimization algorithms in low dimensions and high dimensions may be different, and so the visualizations below may not be indicative of the behaviours of optimization algorithms in high dimensions. However, they provide some useful intuitions about the kinds of behaviour that can be learned.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/lto/traj_visualizations.png" alt="Our algorithm is able to recover after overshooting without oscillating and converge quickly when gradients are small." />
</p>
<p>The plots above show the optimization trajectories followed by various algorithms on two different unseen logistic regression problems. Each arrow represents one iteration of an optimization algorithm. As shown, the algorithm learned using our approach (shown in light red) takes much larger steps compared to other algorithms. In the first example, because the learned algorithm takes large steps, it overshoots after two iterations, but does not oscillate and instead takes smaller steps to recover. In the second example, due to vanishing gradients, traditional optimization algorithms take small steps and therefore converge slowly. On the other hand, the learned algorithm takes much larger steps and converges faster.</p>
<h2 id="papers">Papers</h2>
<p>More details can be found in our papers:</p>
<p><strong>Learning to Optimize</strong><br />
Ke Li, Jitendra Malik<br />
<a href="https://arxiv.org/abs/1606.01885" title="Learning to Optimize"><em>arXiv:1606.01885</em></a>, 2016 and <a href="https://openreview.net/forum?id=ry4Vrt5gl" title="Learning to Optimize"><em>International Conference on Learning Representations (ICLR)</em></a>, 2017</p>
<p><strong>Learning to Optimize Neural Nets</strong><br />
Ke Li, Jitendra Malik<br />
<a href="https://arxiv.org/abs/1703.00441" title="Learning to Optimize Neural Nets"><em>arXiv:1703.00441</em></a>, 2017</p>
<p><em>I’d like to thank Jitendra Malik for his valuable feedback.</em></p>
Tue, 12 Sep 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/09/12/learning-to-optimize-with-rl/
http://bair.berkeley.edu/blog/2017/09/12/learning-to-optimize-with-rl/Learning a Multi-View Stereo Machine<p>Consider looking at a photograph of a chair.
We humans have the remarkable capacity of inferring properties about the 3D shape of the chair from this single photograph even if we might not have seen such a chair ever before.
A more representative example of our experience though is being in the same physical space as the chair and accumulating information from various viewpoints around it to build up our hypothesis of the chair’s 3D shape.
How do we solve this complex 2D to 3D inference task? What kind of cues do we use?<br />
How do we seamlessly integrate information from just a few views to build up a holistic 3D model of the scene?</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/problem_fig.png" alt="Problem Statement" /></p>
<p>A vast body of work in computer vision has been devoted to developing algorithms which leverage various cues from images that enable this task of 3D reconstruction.
They range from monocular <a href="http://www.eruptingmind.com/depth-perception-cues-other-forms-of-perception/">cues</a> such as shading, linear perspective, size constancy etc. to binocular and even multi-view stereopsis.
The dominant paradigm for integrating multiple views has been to leverage stereopsis, i.e. if a point in the 3D world is viewed from multiple viewpoints, its location in 3D can be determined by triangulating its projections in the respective views.
This family of algorithms has led to work on Structure from Motion (SfM) and Multi-view Stereo (MVS) and have been used to produce <a href="https://grail.cs.washington.edu/rome/">city-scale</a> <a href="http://www.di.ens.fr/pmvs/">3D models</a> and enable rich visual experiences such as <a href="http://mashable.com/2017/06/28/apple-maps-flyover/">3D flyover</a> <a href="https://vr.google.com/earth/">maps</a>.
With the advent of deep neural networks and their immense power in modelling visual data, the focus has recently shifted to modelling monocular cues implicitly with a CNN and predicting 3D from a single image as <a href="http://www.cs.nyu.edu/~deigen/dnl/">depth</a>/<a href="http://www.cs.cmu.edu/~xiaolonw/deep3d.html">surface orientation</a> maps or 3D <a href="http://3d-r2n2.stanford.edu/">voxel</a> <a href="https://rohitgirdhar.github.io/GenerativePredictableVoxels/">grids</a>.</p>
<p>In our <a href="https://arxiv.org/abs/1708.05375">recent work</a>, we tried to unify these paradigms of single and multi-view 3D reconstruction.
We proposed a novel system called a Learnt Stereo Machine (LSM) that can leverage monocular/semantic cues for single-view 3D reconstruction while also being able to integrate information from multiple viewpoints using stereopsis - all within a single end-to-end learnt deep neural network.</p>
<!--more-->
<h2 id="learnt-stereo-machines">Learnt Stereo Machines</h2>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/Network.png" alt="Learnt Stereo Machine" />
LSMs are designed to solve the task of multi-view stereo. Given a set of images with <em>known camera poses</em>, they produce a 3D model for the underlying scene - specifically either a voxel occupancy grid or a dense point cloud of the scene in the form of a pixel-wise depth map per input view. While designing LSMs, we drew inspiration from classic works on MVS. These methods first <em>extract features</em> from the images for finding correspondences between them. By comparing the features between images, a matching cost volume is formed. These (typically noisy) matching costs are then <em>filtered/regularized</em> by aggregating information across multiple scales and incorporating priors on shape such as local smoothness, piecewise planarity etc. The final filtered cost volume is then decoded into the desired shape representation such as a 3D volume/surface/disparity maps.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/proj_gif.gif" style="width:45%; margin-left:4%; border-right:solid; border-width:1px; border-color:rgba(0,0,0,0.42);" />
<img src="http://bair.berkeley.edu/blog/assets/unified-3d/unproj_gif.gif" style="width:45%; margin-right:4%" /></p>
<p>The key ingredients here are a differentiable feature <strong>projection</strong> and <strong>unprojection</strong> modules which allow LSMs to move between 2D image and 3D world spaces in a geometrically consistent manner. The unprojection operation places features from a 2D image (extracted by a feedforward CNN) into a 3D world grid such that features from multiple such images align in the 3D grid according to epipolar constraints. This simplifies feature matching as now a search along an epipolar line to compute matching costs reduces to just looking up all features which map to a given location in the 3D world grid. This feature matching is modeled using a 3D recurrent unit which performs sequential matching of the unprojected grids while maintaining a running estimate of the matching scores. Once we filter the local matching cost volume using a 3D CNN, we either decode it directly into a 3D voxel occupancy grid for the voxel prediction task or project it back into 2D image space using a differentiable projection operation. The projection operation can be thought of as the inverse of the unprojection operation where we take a 3D feature grid and sample features along viewing rays at equal depth intervals to place them in a 2D feature map. These projected feature maps are then decoded into per view depth maps by a series of convolution operations. As every step in our network is completely differentiable, we can train the system end-to-end with depth maps or voxel grids as supervision!</p>
<p>As LSMs can predict 3D from a variable number of images (even just a single image), they can choose to either rely heavily on multi-view stereopsis cues or single-view semantic cues depending on the instance and number of views at hand. LSMs can produce both coarse full 3D voxel grids as well as dense depth maps thus unifying the two major paradigms in 3D prediction using deep neural networks.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/voxel_results.png" alt="voxel" /></p>
<p>In our report, we showed drastic improvements on voxel based multi-view 3D object reconstruction when compared to the <a href="http://3d-r2n2.stanford.edu/">previous state-of-the-art</a> which integrates multiple views using a recurrent neural network. We also demonstrated out-of-category generalization, i.e. LSMs can reconstruct cars even if they are only trained on images of aeroplanes and chairs. This is only possible due to our geometric treatment of the task.
We also show dense reconstructions from a few views - much fewer than what is required by classical MVS systems.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/unified-3d/depth_results.png" alt="voxel" /></p>
<h2 id="whats-next">What’s Next?</h2>
<p>LSMs are a step towards unifying a number of paradigms in 3D reconstruction - single and multi-view, semantic and geometric reconstruction, coarse and dense predictions. A joint treatment of these problems helps us learn models that are more robust and accurate while also being simpler to deploy than pipelined solutions.</p>
<p>These are exciting times in 3D computer vision. Predicting <a href="http://bair.berkeley.edu/blog/2017/08/23/high-quality-3d-obj-reconstruction/">high resolution geometry</a> with deep networks is now possible. We can even train for 3D prediction <a href="http://bair.berkeley.edu/blog/2017/07/11/confluence-of-geometry-and-learning/">without explicit 3D</a> supervision. We can’t wait to use these techniques/ideas within LSMs. It remains to be seen how lifting images from 2D to 3D and reasoning about them in metric world space would help other downstream tasks such as navigation and grasping but it sure will be an interesting journey! We will release the code for LSMs soon for easy experimentation and reproducibility. Feel free to use it and leave comments!</p>
<hr />
<p>We would like to thank Saurabh Gupta, Shubham Tulsiani and David Fouhey.</p>
<p><strong>This blog post is based on the following report</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/1708.05375"><em>Learning a Multi-view Stereo Machine</em></a><br />
<a href="https://people.eecs.berkeley.edu/~akar/">Abhishek Kar</a>, <a href="https://people.eecs.berkeley.edu/~chaene/">Christian Häne</a>, <a href="https://people.eecs.berkeley.edu/~malik/">Jitendra Malik</a>, NIPS, 2017</li>
</ul>
Tue, 05 Sep 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/09/05/unified-3d/
http://bair.berkeley.edu/blog/2017/09/05/unified-3d/How to Escape Saddle Points Efficiently<p><em>This post was initially published on <a href="http://www.offconvex.org/2017/07/19/saddle-efficiency/">Off the Convex Path</a>. It is reposted here with authors’ permission.</em></p>
<p>A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s</a> and <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s</a> blog posts), the critical open problem is one of <strong>efficiency</strong> — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe <a href="https://arxiv.org/abs/1703.00887">our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade</a>, that provides the first provable <em>positive</em> answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!</p>
<!--more-->
<h2 id="perturbing-gradient-descent">Perturbing Gradient Descent</h2>
<p>We are in the realm of classical gradient descent (GD) — given a function $f:\mathbb{R}^d \to \mathbb{R}$ we aim to minimize the function by moving in the direction of the negative gradient:</p>
<script type="math/tex; mode=display">x_{t+1} = x_t - \eta \nabla f(x_t),</script>
<p>where $x_t$ are the iterates and $\eta$ is the step size. GD is well understood theorietically in the case of convex optimization, but the general case of nonconvex optimization has been far less studied. We know that GD converges quickly to the neighborhood of stationary points (points where $\nabla f(x) = 0$) in the nonconvex setting, but these stationary points may be local minima or, unhelpfully, local maxima or saddle points.</p>
<p>Clearly GD will never move away from a stationary point if started there (even a local maximum); thus, to provide general guarantees, it is necessary to modify GD slightly to incorporate some degree of randomness. Two simple methods have been studied in the literature:</p>
<ol>
<li>
<p><strong>Intermittent Perturbations</strong>: <a href="http://arxiv.org/abs/1503.02101">Ge, Huang, Jin and Yuan 2015</a> considered adding occasional random perturbations to GD, and were able to provide the first <em>polynomial time</em> guarantee for GD to escape saddle points. (See also <a href="http://www.offconvex.org/2016/03/22/saddlepoints/">Rong Ge’s post</a> )</p>
</li>
<li>
<p><strong>Random Initialization</strong>: <a href="http://arxiv.org/abs/1602.04915">Lee et al. 2016</a> showed that with only random initialization, GD provably avoids saddle points asymptotically (i.e., as the number of steps goes to infinity). (see also <a href="http://www.offconvex.org/2016/03/24/saddles-again/">Ben Recht’s post</a>)</p>
</li>
</ol>
<p>Asymptotic — and even polynomial time —results are important for the general theory, but they stop short of explaining the success of gradient-based algorithms in practical nonconvex problems. And they fail to provide reassurance that runs of GD can be trusted — that we won’t find ourselves in a situation in which the learning curve flattens out for an indefinite amount of time, with the user having no way of knowing that the asymptotics have not yet kicked in. Lastly, they fail to provide reassurance that GD has the kind of favorable properties in high dimensions that it is known to have for convex problems.</p>
<p>One reasonable approach to this issue is to consider second-order (Hessian-based) algorithms. Although these algorithms are generally (far) more expensive per iteration than GD, and can be more complicated to implement, they do provide the kind of geometric information around saddle points that allows for efficient escape. Accordingly, a reasonable understanding of Hessian-based algorithms has emerged in the literature, and positive efficiency results have been obtained.</p>
<p><strong><em>Is GD also efficient? Or is the Hessian necessary for fast escape of saddle points?</em></strong></p>
<p>A negative result emerges to this first question if one considers the random initialization strategy discussed. Indeed, this approach is provably <em>inefficient</em> in general, taking exponential time to escape saddle points in the worst case (see “On the Necessity of Adding Perturbations” section).</p>
<p>Somewhat surprisingly, it turns out that we obtain a rather different — and <em>positive</em> — result if we consider the perturbation strategy. To be able to state this result, let us be clear on the algorithm that we analyze:</p>
<blockquote>
<p><strong>Perturbed gradient descent (PGD)</strong></p>
<ol>
<li><strong>for</strong> $~t = 1, 2, \ldots ~$ <strong>do</strong></li>
<li>$\quad\quad x_{t} \leftarrow x_{t-1} - \eta \nabla f (x_{t-1})$</li>
<li>$\quad\quad$ <strong>if</strong> $~$<em>perturbation condition holds</em>$~$ <strong>then</strong></li>
<li>$\quad\quad\quad\quad x_t \leftarrow x_t + \xi_t$</li>
</ol>
</blockquote>
<p>Here the perturbation $\xi_t$ is sampled uniformly from a ball centered at zero with a suitably small radius, and is added to the iterate when the gradient is suitably small. These particular choices are made for analytic convenience; we do not believe that uniform noise is necessary. nor do we believe it essential that noise be added only when the gradient is small.</p>
<h2 id="strict-saddle-and-second-order-stationary-points">Strict-Saddle and Second-order Stationary Points</h2>
<p>We define <em>saddle points</em> in this post to include both classical saddle points as well as local maxima. They are stationary points which are locally maximized along <em>at least one direction</em>. Saddle points and local minima can be categorized according to the minimum eigenvalue of Hessian:</p>
<script type="math/tex; mode=display">% <![CDATA[
\lambda_{\min}(\nabla^2 f(x)) \begin{cases}
> 0 \quad\quad \text{local minimum} \\
= 0 \quad\quad \text{local minimum or saddle point} \\
< 0 \quad\quad \text{saddle point}
\end{cases} %]]></script>
<p>We further call the saddle points in the last category, where $\lambda_{\min}(\nabla^2 f(x)) < 0$, <strong>strict saddle points</strong>.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/saddle_eff/strictsaddle.png" class="stretch-center" /></p>
<p>While non-strict saddle points can be flat in the valley, strict saddle points require that there is <em>at least one direction</em> along which the curvature is strictly negative. The presence of such a direction gives a gradient-based algorithm the possibility of escaping the saddle point. In general, distinguishing local minima and non-strict saddle points is <em>NP-hard</em>; therefore, we — and previous authors — focus on escaping <em>strict</em> saddle points.</p>
<p>Formally, we make the following two standard assumptions regarding smoothness.</p>
<blockquote>
<p><strong>Assumption 1</strong>: $f$ is $\ell$-gradient-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2, |\nabla f(x_1) - \nabla f(x_2)| \le \ell |x_1 - x_2|$. <br />
$~$<br />
<strong>Assumption 2</strong>: $f$ is $\rho$-Hessian-Lipschitz, i.e. <br />
$\quad\quad\quad\quad \forall x_1, x_2$, $|\nabla^2 f(x_1) - \nabla^2 f(x_2)| \le \rho |x_1 - x_2|$.</p>
</blockquote>
<p>Similarly to classical theory, which studies convergence to a first-order stationary point, $\nabla f(x) = 0$, by bounding the number of iterations to find a <strong>$\epsilon$-first-order stationary point</strong>, $|\nabla f(x)| \le \epsilon$, we formulate the speed of escape of strict saddle points and the ensuing convergence to a second-order stationary point, $\nabla f(x) = 0, \lambda_{\min}(\nabla^2 f(x)) \ge 0$, with an $\epsilon$-version of the definition:</p>
<blockquote>
<p><strong>Definition</strong>: A point $x$ is an <strong>$\epsilon$-second-order stationary point</strong> if:<br />
$\quad\quad\quad\quad |\nabla f(x)|\le \epsilon$, and $\lambda_{\min}(\nabla^2 f(x)) \ge -\sqrt{\rho \epsilon}$.</p>
</blockquote>
<p>In this definition, $\rho$ is the Hessian Lipschitz constant introduced above. This scaling follows the convention of <a href="http://rd.springer.com/article/10.1007%2Fs10107-006-0706-8">Nesterov and Polyak 2006</a>.</p>
<h3 id="applications">Applications</h3>
<p>In a wide range of practical nonconvex problems it has been proved that <strong>all saddle points are strict</strong> — such problems include, but not are limited to, principal components analysis, canonical correlation analysis,
<a href="http://arxiv.org/abs/1503.02101">orthogonal tensor decomposition</a>,
<a href="http://arxiv.org/abs/1602.06664">phase retrieval</a>,
<a href="http://arxiv.org/abs/1504.06785">dictionary learning</a>,
<!-- matrix factorization, -->
<a href="http://arxiv.org/abs/1605.07221">matrix sensing</a>,
<a href="http://arxiv.org/abs/1605.07272">matrix completion</a>,
and <a href="http://arxiv.org/abs/1704.00708">other nonconvex low-rank problems</a>.</p>
<p>Furthermore, in all of these nonconvex problems, it also turns out that <strong>all local minima are global minima</strong>. Thus, in these cases, any general efficient algorithm for finding $\epsilon$-second-order stationary points immediately becomes an efficient algorithm for solving those nonconvex problem with global guarantees.</p>
<h2 id="escaping-saddle-point-with-negligible-overhead">Escaping Saddle Point with Negligible Overhead</h2>
<p>In the classical case of first-order stationary points, GD is known to have very favorable theoretical properties:</p>
<blockquote>
<p><strong>Theorem (<a href="http://rd.springer.com/book/10.1007%2F978-1-4419-8853-9">Nesterov 1998</a>)</strong>: If Assumption 1 holds, then GD, with $\eta = 1/\ell$, finds an $\epsilon$-<strong>first</strong>-order stationary point in $2\ell (f(x_0) - f^\star)/\epsilon^2$ iterations.</p>
</blockquote>
<p>In this theorem, $x_0$ is the initial point and $f^\star$ is the function value of the global minimum. The theorem says for that any gradient-Lipschitz function, a stationary point can be found by GD in $O(1/\epsilon^2)$ steps, with no explicit dependence on $d$. This is called “dimension-free optimization” in the literature; of course the cost of a gradient computation is $O(d)$, and thus the overall runtime of GD scales as $O(d)$. The linear scaling in $d$ is especially important for modern high-dimensional nonconvex problems such as deep learning.</p>
<p>We now wish to address the corresponding problem for second-order stationary points.
What is the best we can hope for? Can we also achieve</p>
<ol>
<li>A dimension-free number of iterations;</li>
<li>An $O(1/\epsilon^2)$ convergence rate;</li>
<li>The same dependence on $\ell$ and $(f(x_0) - f^\star)$ as in (Nesterov 1998)?</li>
</ol>
<p>Rather surprisingly, the answer is <em>Yes</em> to all three questions (up to small log factors).</p>
<blockquote>
<p><strong>Main Theorem</strong>: If Assumptions 1 and 2 hold, then PGD, with $\eta = O(1/\ell)$, finds an $\epsilon$-<strong>second</strong>-order stationary point in $\tilde{O}(\ell (f(x_0) - f^\star)/\epsilon^2)$ iterations with high probability.</p>
</blockquote>
<p>Here $\tilde{O}(\cdot)$ hides only logarithmic factors; indeed, the dimension dependence in our result is only $\log^4(d)$. The theorem thus asserts that a perturbed form of GD, under an additional Hessian-Lipschitz condition, <strong><em>converges to a second-order-stationary point in almost the same time required for GD to converge to a first-order-stationary point.</em></strong> In this sense, we claim that PGD can escape strict saddle points almost for free.</p>
<p>We turn to a discussion of some of the intuitions underlying these results.</p>
<h3 id="why-do-polylogd-iterations-suffice">Why do polylog(d) iterations suffice?</h3>
<p>Our strict-saddle assumption means that there is only, in the worst case, one direction in $d$ dimensions along which we can escape. A naive search for the descent direction intuitively should take at least $\text{poly}(d)$ iterations, so why should only $\text{polylog}(d)$ suffice?</p>
<p>Consider a simple case in which we assume that the function is quadratic in the neighborhood of the saddle point. That is, let the objective function be $f(x) = x^\top H x$, a saddle point at zero, with constant Hessian $H = \text{diag}(-1, 1, \cdots, 1)$. In this case, only the first direction is an escape direction (with negative eigenvalue $-1$).</p>
<p>It is straightforward to work out the general form of the iterates in this case:</p>
<script type="math/tex; mode=display">x_t = x_{t-1} - \eta \nabla f(x_{t-1}) = (I - \eta H)x_{t-1} = (I - \eta H)^t x_0.</script>
<p>Assume that we start at the saddle point at zero, then add a perturbation so that $x_0$ is sampled uniformly from a ball $\mathcal{B}_0(1)$ centered at zero with radius one.
The decrease in the function value can be expressed as:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = x_t^\top H x_t = x_0^\top (I - \eta H)^t H (I - \eta H)^t x_0.</script>
<p>Set the step size to be $1/2$, let $\lambda_i$ denote the $i$-th eigenvalue of the Hessian $H$ and let $\alpha_i = e_i^\top x_0$ denote the component in the $i$th direction of the initial point $x_0$. We have $\sum_{i=1}^d \alpha_i^2 = | x_0|^2 = 1$, thus:</p>
<script type="math/tex; mode=display">f(x_t) - f(0) = \sum_{i=1}^d \lambda_i (1-\eta\lambda_i)^{2t} \alpha_i^2 \le -1.5^{2t} \alpha_1^2 + 0.5^{2t}.</script>
<p>A simple probability argument shows that sampling uniformly in $\mathcal{B}_0(1)$ will result in at least a $\Omega(1/d)$ component in the first direction with high probability. That is, $\alpha^2_1 = \Omega(1/d)$. Substituting $\alpha_1$ in the above equation, we see that it takes at most $O(\log d)$ steps for the function value to decrease by a constant amount.</p>
<h3 id="pancake-shape-stuck-region-for-general-hessian">Pancake-shape stuck region for general Hessian</h3>
<p>We can conclude that for the case of a constant Hessian, only when the perturbation $x_0$ lands in the set $\{x | ~ |e_1^\top x|^2 \le O(1/d)\}$ $\cap \mathcal{B}_0 (1)$, can we take a very long time to escape the saddle point. We call this set the <strong>stuck region</strong>; in this case it is a flat disk. In general, when the Hessian is no longer constant, the stuck region becomes a non-flat pancake, depicted as a green object in the left graph. In general this region will not have an analytic expression.</p>
<p>Earlier attempts to analyze the dynamics around saddle points tried to the approximate stuck region by a flat set. This results in a requirement of an extremely small step size and a correspondingly very large runtime complexity. Our sharp rate depends on a key observation — <em>although we don’t know the shape of the stuck region, we know it is very thin</em>.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/saddle_eff/flow.png" class="stretch-center" /></p>
<p>In order to characterize the “thinness” of this pancake, we studied pairs of hypothetical perturbation points $w, u$ separated by $O(1/\sqrt{d})$ along an escaping direction. We claim that if we run GD starting at $w$ and $u$, at least one of the resulting trajectories will escape the saddle point very quickly. This implies that the thickness of the stuck region can be at most $O(1/\sqrt{d})$, so a random perturbation has very little chance to land in the stuck region.</p>
<h2 id="on-the-necessity-of-adding-perturbations">On the Necessity of Adding Perturbations</h2>
<p>We have discussed two possible ways to modify the standard gradient descent algorithm, the first by adding intermittent perturbations, and the second by relying on random initialization. Although the latter exhibits asymptotic convergence, it does not yield efficient convergence in general; in recent <a href="http://arxiv.org/abs/1705.10412">joint work with Simon Du, Jason Lee, Barnabas Poczos, and Aarti Singh</a>, we have shown that even with fairly natural random initialization schemes and non-pathological functions, <strong>GD with only random initialization can be significantly slowed by saddle points, taking exponential time to escape. The behavior of PGD is strikingingly different — it can generically escape saddle points in polynomial time.</strong></p>
<p>To establish this result, we considered random initializations from a very general class including Gaussians and uniform distributions over the hypercube, and we constructed a smooth objective function that satisfies both Assumptions 1 and 2. This function is constructed such that, even with random initialization, with high probability both GD and PGD have to travel sequentially in the vicinity of $d$ strict saddle points before reaching a local minimum. All strict saddle points have only one direction of escape. (See the left graph for the case of $d=2$).</p>
<p><img src="http://bair.berkeley.edu/blog/assets/saddle_eff/necesperturbation.png" class="stretch-center" /></p>
<p>When GD travels in the vicinity of a sequence of saddle points, it can get closer and closer to the later saddle points, and thereby take longer and longer to escape. Indeed, the time to escape the $i$th saddle point scales as $e^{i}$. On the other hand, PGD is always able to escape any saddle point in a small number of steps independent of the history. This phenomenon is confirmed by our experiments; see, for example, an experiment with $d=10$ in the right graph.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post, we have shown that a perturbed form of gradient descent can converge to a second-order-stationary point at almost the same rate as standard gradient descent converges to a first-order-stationary point. This implies that Hessian information is not necessary for to escape saddle points efficiently, and helps to explain why basic gradient-based algorithms such as GD (and SGD) work surprisingly well in the nonconvex setting. This new line of sharp convergence results can be directly applied to nonconvex problem such as matrix sensing/completion to establish efficient global convergence rates.</p>
<p>There are of course still many open problems in general nonconvex optimization. To name a few: will adding momentum improve the convergence rate to a second-order stationary point? What type of local minima are tractable and are there useful structural assumptions that we can impose on local minima so as to avoid local minima efficiently? We are making slow but steady progress on nonconvex optimization, and there is the hope that at some point we will transition from “black art” to “science”.</p>
Thu, 31 Aug 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/High Quality 3D Object Reconstruction from a Single Color Image<p>Digitally reconstructing 3D geometry from images is a core problem in computer vision. There are various applications, such as movie productions, content generation for video games, virtual and augmented reality, 3D printing and many more. The task discussed in this blog post is reconstructing high quality 3D geometry from a single color image of an object as shown in the figure below.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/hsp/image_0.png" width="600" />
</p>
<p>Humans have the ability to effortlessly reason about the shapes of objects and scenes even if we only see a single image. Note that the binocular arrangement of our eyes allows us to perceive depth, but it is not required to understand 3D geometry. Even if we only see a photograph of an object we have a good understanding of its shape. Moreover, we are also able to reason about the unseen parts of objects such as the back, which is an important ability for grasping objects. The question which immediately arises is how are humans able to reason about geometry from a single image? And in terms of artificial intelligence: how can we teach machines this ability?</p>
<!--more-->
<h1 id="shape-spaces">Shape Spaces</h1>
<p>The basic principle used to reconstruct geometry from ambiguous input is the fact that shapes are not arbitrary, and hence some shapes are likely, and some very unlikely. In general surfaces tend to be smooth. In man-made environments they are often piece-wise planar. For objects high level rules apply. For example airplanes very commonly have a fuselage with two main wings attached on each side and on the back a vertical stabilizer. Humans are able to acquire this knowledge by observing the world with their eyes and interacting with the world using their hands. In computer vision the fact that shapes are not arbitrary allows us to describe all possible shapes of an object class or multiple object classes as a low dimensional shape space, which is acquired from large collections of example shapes.</p>
<h2 id="voxel-prediction-using-cnns">Voxel Prediction Using CNNs</h2>
<p>One of the most recent lines of work for 3D reconstruction [<a href="https://arxiv.org/abs/1604.00449">Choy et al. ECCV 2016</a>, <a href="https://arxiv.org/abs/1603.08637">Girdhar et al. ECCV 2016</a>] utilizes convolutional neural networks (CNNs) to predict the shape of objects as a 3D occupancy volume. The 3D output volume is subdivided into volume elements, called voxels, and for each voxel an assignment to be either occupied or free space, i.e. the interior or exterior of the object respectively, is determined. The input is commonly given as a single color image which depicts the object, and the CNN predicts an occupancy volume using an up-convolutional decoder architecture. The network is trained end-to-end and supervised with known ground truth occupancy volumes which are acquired from synthetic CAD model datasets. Using this 3D representation and CNNs, models which are able to fit into a variety of object classes can be learned.</p>
<h1 id="hierarchical-surface-prediction">Hierarchical Surface Prediction</h1>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_1.png" class="stretch-center" /></p>
<p>The main shortcoming with predicting occupancy volumes using a CNN is that the output space is three dimensional and hence has cubic growth with respect to increased resolution. This problem prevents the works mentioned above from predicting high quality geometry and is therefore restricted to coarse resolution voxel grids, e.g. 32<sup>3</sup> (c.f. figure above). In our work we argue that this is an unnecessary restriction given that surfaces are actually only two dimensional. We exploit the two dimensional nature of surfaces by hierarchically predicting fine resolution voxels only where a surface is expected judging from the low resolution prediction. The basic idea is closely related to octree representations which are often used in multi-view stereo and depth map fusion to represent high resolution geometry.</p>
<h2 id="method">Method</h2>
<p>The basic 3D prediction pipeline takes a color image as input which gets first encoded into a low dimensional representation using a convolutional encoder. This low dimensional representation then gets decoded into a 3D occupancy volume. The main idea of our method, called hierarchical surface prediction (HSP), is to start decoding by predicting low resolution voxels. However, in contrast to the standard approach where each voxel would get classified into either free or occupied space, we use three classes: free space, occupied space, and boundary. This allows us to analyze the outputs at low resolution and only predict a higher resolution of the parts of the volume where there is evidence that it contains the surface. By iterating the refinement procedure we hierarchically predict high resolution voxel grids (see figure below). For more details about the method we refer the reader to our tech report [<a href="https://arxiv.org/abs/1704.00710">Häne et al. arXiv 2017</a>].</p>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_2.png" class="stretch-center" /></p>
<h2 id="experiments">Experiments</h2>
<p>Our experiments are mainly conducted on the synthetic <a href="https://shapenet.org/">ShapeNet</a> dataset [<a href="https://arxiv.org/abs/1512.03012">Chang et al. arXiv 2015</a>]. The main task we studied is predicting high resolution geometry from a single color image. We compare our method to two baselines which we call low resolution hard (LR hard) and low resolution soft (LR soft). These baselines predict at the same coarse resolution of 32<sup>3</sup> but differ in how the training data is generated. The LR hard baseline uses binary assignments for the voxels. All voxels are labeled as occupied if at least one of the corresponding high resolution voxels is occupied. The LR soft baseline uses fractional assignments reflecting the percentage of occupied voxels in the corresponding high resolution voxels. Our method, HSP predicts at a resolution of 256<sup>3</sup>. The results in the figures below show the benefits in terms of surface quality and completeness of the high resolution prediction compared to the low resolution baselines. Quantitative results and more experiments can be found in our tech report.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_3.png" class="stretch-center" /></p>
<p><img src="http://bair.berkeley.edu/blog/assets/hsp/image_4.png" class="stretch-center" /></p>
<p>I would like to thank Shubham Tulsiani and Jitendra Malik for their valuable feedback.</p>
<p><strong>This blog post is based on the tech report:</strong></p>
<ul>
<li>Hierarchical Surface Prediction for 3D Object Reconstruction, C. Häne, S.Tulsiani, J.Malik, ArXiv 2017</li>
</ul>
Wed, 23 Aug 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/08/23/high-quality-3d-obj-reconstruction/
http://bair.berkeley.edu/blog/2017/08/23/high-quality-3d-obj-reconstruction/Cooperatively Learning Human Values<h2 id="be-careful-what-you-reward">Be careful what you reward</h2>
<p>“Be careful what you wish for!” – we’ve all heard it! The story of King Midas
is there to warn us of what might happen when we’re not. Midas, a king who loves
gold, runs into a satyr and wishes that everything he touches would turn to gold.
Initially, this is fun and he walks around turning items to gold. But his
happiness is short lived. Midas realizes the downsides of his wish when he hugs
his daughter and she turns into a golden statue.</p>
<p><img src="http://bair.berkeley.edu/blog/assets/coop_irl/midas.png" alt="midas" width="240" hspace="30" align="right" /></p>
<p>We, humans, have a notoriously difficult time specifying what we actually want,
and the AI systems we build suffer from it. With AI, this warning actually
becomes “Be careful what you <em>reward</em>!”. When we design and deploy an AI agent
for some application, we need to specify what we want it to do, and this
typically takes the form of a <em>reward function</em>: a function that tells the agent
which state and action combinations are good. A car reaching its destination is
good, and a car crashing into another car is not so good.</p>
<p>AI research has made a lot of progress on algorithms for generating AI behavior
that performs well according to the <em>stated</em> reward function, from classifiers
that correctly label images with what’s in them, to cars that are starting to
drive on their own. But, as the example of King Midas teaches us, it’s not the
stated reward function that matters: what we really need are algorithms for
generating AI behavior that performs well according to the designer or user’s
<em>intended</em> reward function.</p>
<p>Our recent work on <a href="https://arxiv.org/abs/1606.03137"><strong>Cooperative
Inverse Reinforcement Learning</strong></a> formalizes and investigates optimal
solutions to this <em>value alignment problem</em> — the joint problem of eliciting
and optimizing a user’s intended objective.</p>
<!--more-->
<h2 id="faulty-incentives-in-ai-systems">Faulty incentives in AI systems</h2>
<p>Open AI gave a recent example of the difference between
<a href="https://blog.openai.com/faulty-reward-functions/">stated vs. intended reward functions</a>.
The system designers were working on reinforcement learning for racing games.
They decided to reward the system for obtaining points; this seems reasonable as
we expect policies that win races to get a lot of points. Unfortunately, this lead
to quite suboptimal behavior in several environments:</p>
<p style="text-align:center;">
<iframe src="//gifs.com/embed/fault-reward-functions-Y6zOjO" frameborder="0" scrolling="no" width="478px" height="360px" style="-webkit-backface-visibility:
hidden;-webkit-transform: scale(1);"></iframe>
</p>
<p>This video demonstrates a racing strategy that pursues points and nothing else,
failing to actually <em>win</em> the race. This is clearly distinct from the <em>desired</em>
behavior, yet the designers did get exactly the behavior they asked for.</p>
<p>For a less light-hearted example of value misalignment, we can look back to late
June 2015. Google had just released an image classifier feature that leveraged
some of the recent advances in image classification. Unfortunately for one user,
the system decided to
<a href="https://www.theverge.com/2015/7/1/8880363/google-apologizes-photos-app-tags-two-black-people-gorillas">classify his African-American friend as a gorilla</a>.</p>
<blockquote class="twitter-tweet tw-align-center" data-lang="en"><p lang="en" dir="ltr">Google Photos, y'all fucked up. My friend's not a gorilla.
<a href="http://t.co/SMkMCsNVX4">pic.twitter.com/SMkMCsNVX4</a></p>— Oluwafemi J Alciné (@jackyalcine)
<a href="https://twitter.com/jackyalcine/status/615329515909156865">June 29, 2015</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>This didn’t happen because someone was ill-intentioned. It happened because of a
misalignment between the objective given to the system and the underlying
objective the company had in building their classifier. The reward function (or
loss function) in classification is defined on pairs: (predicted label, true
label). The standard reward function in classification research gives a reward
of 0 for a correct classification (i.e., the predicted and true labels match)
and a reward of -1 otherwise. This implies that all misclassifications are
<em>equally</em> bad — but that’s not right, especially when it comes to
misclassifying people.</p>
<p>According to the incentives it was given, the learning algorithm was willing to
trade a reduction in the chance of, say, misclassifying a bicycle as a toaster
for an equivalent increase in the chance of misclassifying a person as an
animal. This is not a trade that a system designer would <em>knowingly</em> make.</p>
<h2 id="the-value-alignment-problem">The Value Alignment Problem</h2>
<p>We can attribute the failures above to the mistaken assumption that the reward
function communicated to the learning system is the true reward function that
the system designer cares about. But in reality, there is often a mismatch, and
this mismatch eventually leads to undesired behavior.</p>
<p>As AI systems are deployed further into the world, the potential consequences of
this undesired behavior grow. For example, we must be quite sure that the
optimization behind the control policy of, e.g., a self-driving car is making
the right tradeoffs. However, ensuring this is hard: there are lots of ways to
drive incorrectly. Enumerating and evaluating them is challenging, to say the
least.</p>
<p>The <strong>value alignment problem</strong> is the problem of aligning AI objectives to
ours. The reason this is so challenging is precisely because it is not always
easy for us to describe what we want, even to other people. We should expect the
same will be true when we communicate goals to AI. And yet, this is not
reflected in the models we use to build AI algorithms. We typically assume, as
in the examples above, that the objective is known and observable.</p>
<h3 id="inverse-reinforcement-learning">Inverse Reinforcement Learning</h3>
<p>One area of research we can look to for inspiration is
<a href="http://ai.stanford.edu/~ang/papers/icml00-irl.pdf"><strong>inverse reinforcement learning.</strong></a>
In artificial intelligence research (e.g., reinforcement learning) we primarily
focus on computing optimal (or even OK) behaviors. That is, given a reward
function we compute an optimal policy. In inverse reinforcement learning, we do
the opposite. We observe optimal behavior and try to compute the reward
function that agent is optimizing. This suggests a rough strategy for value
alignment: the robot observes human behavior, learns the human reward function
with inverse reinforcement learning, and behaves according to that function.</p>
<p>This strategy suffers from three flaws. The first is fairly simple: the robot
needs to know that it is optimizing reward <em>for</em> the human; if a robot learns
that a person wants coffee it should get coffee for the person, as opposed to
obtaining coffee for itself. The second challenge is harder to account for:
people are strategic. If you know that a robot is watching you to learn what you
want, that will change your behavior. You may exaggerate steps of the task, or
demonstrate common mistakes or pitfalls. These types of cooperative
teaching behaviors are simply not modelled by inverse reinforcement learning.
Finally, inverse reinforcement learning is a pure inference problem, but in
value alignment the robot has to <em>jointly</em> learn its goal and take steps to
accomplish it. This means the robot has to account for an
exploration-exploitation tradeoff during learning. Inverse reinforcement
learning does not provide any guidance on how to balance these competing
concerns.</p>
<h3 id="cooperative-inverse-reinforcement-learning">Cooperative Inverse Reinforcement Learning</h3>
<p>Our recent work within the
<a href="http://humancompatible.ai">Center for Human-compatible AI</a>
introduced a formalism for the value alignment problem that accounts for these
discrepancies called
<a href="https://arxiv.org/abs/1606.03137">Cooperative Inverse Reinforcement Learning (CIRL)</a>.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/coop_irl/cirl.png" width="600" alt="cirl" />
</p>
<p>CIRL formalizes value alignment as a game with two players. A human player -
we’ll call her Alice, and a robot player - we’ll call it Rob. Instead of Rob
optimizing some given reward function, the two are cooperating to accomplish a
shared goal, say making coffee. Importantly, only Alice knows this goal. Thus,
Rob’s task is to learn the goal (e.g., by communicating with or observing Alice)
and take steps to help accomplish it. A solution to this game is a cooperation
strategy that describes how Alice and Rob should act and respond to each other.
Rob will interpret what Alice does to get a better understanding of the goal,
and even act to get clarification. Alice, in turn, will act in a way that makes
it easy for Rob to help.</p>
<p>We can see that there is a close connection to inverse reinforcement learning.
Alice is acting optimally according to some reward function and, in the course
of helping her, Rob will learn the reward function Alice is optimizing. The
crucial difference is that Alice knows Rob is trying to help and this means that
the optimal cooperation strategy will include teaching behaviors for Alice and
determine the best way for Rob to manage the exploration-exploitation tradeoff.</p>
<h2 id="whats-next">What’s Next?</h2>
<p>With CIRL, we advocate that robots should have uncertainty about what the right
reward function is. In two upcoming publications, to be presented at
<a href="http://ijcai-17.org">IJCAI 2017</a>,
we investigated the impact of this reward uncertainty on optimal behavior.
<a href="https://arxiv.org/abs/1611.08219">“The Off-Switch Game”</a>
analyzes robots’ incentives to accept human oversight or intervention.
We model this with a CIRL game where Alice can switch Rob off, but Rob can
disable the off switch. We find that Rob’s uncertainty about Alice’s goal is a
crucial component of the incentive to listen to her.</p>
<p>However, as the story of King Midas illustrates, we humans are not always
perfect at giving orders. There may be situations where we want Rob to do what
Alice means, not what she says. In
<a href="https://arxiv.org/abs/1705.09990">“Should Robots be Obedient?”</a>,
we analyze the tradeoff between Rob’s obedience level (the rate at which it
follows Alice’s orders) and the value it can generate for Alice. We show that,
at least in theory, Rob can be more valuable if it can disobey Alice, but also
analyze how this performance degrades if Rob’s model of the world is
incorrect.</p>
<p>In studying the value alignment problem, we hope to lay the groundwork for
algorithms that can reliably determine and pursue our desired objectives. In the
long run, we expect this to lead to
<a href="https://www.ted.com/talks/stuart_russell_how_ai_might_make_us_better_people/transcript?language=en">safer designs for artificial intelligence</a>.
The key idea in our approach is that we must account for uncertainty about the
true reward signal, rather than taking the reward as given. Our work shows that
this leads to AI systems that are more willing to accept human oversight and
generate more value for human users. Our work also gives us a tool to analyze
potential pitfalls in preference learning and investigate the impacts of model
misspecification. Going further, we plan to explore efficient algorithms for
computing solutions to CIRL games, as well as consider extensions to
the value alignment problem that account for multiple people, each with their
own goals and preferences.</p>
Thu, 17 Aug 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/08/17/cooperatively-learning-human-values/
http://bair.berkeley.edu/blog/2017/08/17/cooperatively-learning-human-values/Captioning Novel Objects in Images<p>Given an image, humans can easily infer the salient entities in it, and describe the scene effectively, such as, where objects are located (in a forest or in a kitchen?), what attributes an object has (brown or white?), and, importantly, how objects interact with other objects in a scene (running in a field, or being held by a person etc.). The task of visual description aims to develop visual systems that generate contextual descriptions about objects in images. Visual description is challenging because it requires recognizing not only objects (bear), but other visual elements, such as actions (standing) and attributes (brown), and constructing a fluent sentence describing how objects, actions, and attributes are related in an image (such as the brown bear is standing on a rock in the forest).</p>
<h2 id="current-state-of-visual-description">Current State of Visual Description</h2>
<table class="col-2">
<tr>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/novel_image_captioning/bear.png" alt="Brown bear in forest" />
</td>
<td style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/novel_image_captioning/anteater.png" alt="Anteater in forest" />
</td>
</tr>
<tr>
<td><p>
<a href="http://jeffdonahue.com/lrcn/">LRCN</a> [Donahue et al. ‘15]: A brown bear standing on top of a lush green field. <br />
<a href="http://captionbot.ai">MS CaptionBot</a> [Tran et al. ‘16]: A large brown bear walking through a forest.
</p></td>
<td><p>
<a href="http://jeffdonahue.com/lrcn/">LRCN</a> [Donahue et al. ‘15]: A black <span style="color:red;">bear</span> is standing in the grass. <br />
<a href="http://captionbot.ai">MS CaptionBot</a> [Tran et al. ‘16]: A <span style="color:red;">bear</span> that is eating some grass.
</p></td>
</tr>
</table>
<p style="text-align:center;"><i>
Descriptions generated by existing captioners on two images. On the left is an image of an object (bear) that is present in training data. On the right is an object (anteater) that the model hasn't seen in training.
</i></p>
<p>Current visual description or image captioning models work quite well, but they can only describe objects seen in existing image captioning training datasets, and they require a large number of training examples to generate good captions. To learn how to describe an object like “jackal” or “anteater” in context, most description models require many examples of jackal or anteater images with corresponding descriptions. However, current visual description datasets, like <a href="mscoco.org">MSCOCO</a>, do not include descriptions about all objects. In contrast, recent works in object recognition through Convolutional Neural Networks (CNNs) can recognize hundreds of categories of objects. While object recognition models can recognize jackals and anteaters, description models cannot compose sentences to describe these animals correctly in context. In our work, we overcome this problem by building visual description systems which can describe new objects without pairs of images and sentences about these objects.</p>
<h2 id="the-task-describing-novel-objects">The Task: Describing Novel Objects</h2>
<p>Here we define our task more formally. Given a dataset consisting of pairs of images and descriptions (paired image-sentence data, e.g. <a href="mscoco.org">MSCOCO</a>) as well as images with object labels but no descriptions (unpaired image data, such as <a href="http://www.image-net.org/">ImageNet</a>) we wish to learn how to describe objects unseen in paired image-sentence data. To do this we must build a model which can recognize different visual constituents (e.g., jackal, brown, standing, and field) and compose these in novel ways to form a coherent description. Below we describe the core components of our description model.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/novel_image_captioning/image_0.png" alt="The novel visual description task" />
</p>
<p>We aim to describe diverse objects which do not have training images with captions.</p>
<!--more-->
<h3 id="using-external-sources-of-data">Using External Sources of Data</h3>
<p>In order to generate captions about diverse categories of objects outside the image-caption training data, we take advantage of external data sources. Specifically, we use ImageNet images with object labels as the unpaired image data source and sentences from unannotated text corpora such as Wikipedia as our text data source. These are used to train our visual recognition CNN and language model respectively.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/novel_image_captioning/image_1.png" alt="Train effectively on external resources" /><br />
<i>
Train effectively on external resources
</i>
</p>
<h3 id="capture-semantic-similarity">Capture semantic similarity</h3>
<p>We want to be able to describe unseen objects (e.g. from ImageNet) that are similar to objects that have been seen in the paired image-sentence training data. We use dense word embeddings to achieve this. Word embeddings are dense high dimensional representations of words where words with similar meaning are closer in the embedding space.</p>
<p>In our previous work, called “Deep Compositional Captioning (DCC)” [1] we first train a caption model on MSCOCO paired image caption dataset. Then to describe novel objects, for each novel object (such as an okapi) we use word embeddings to identify an object that’s most similar amongst the objects in the MSCOCO dataset (in this case zebra). We then transfer (copy) the parameters learned by the model from the seen object to the unseen object (i.e. copy weights in the network corresponding to zebra to those corresponding to okapi).</p>
<h3 id="novel-object-captioning">Novel Object Captioning</h3>
<p>While the DCC model is able to describe several unseen object categories, copying parameters from one object to another can create sentences with grammatical artifacts. E.g. for the object ‘racket’ the model copies weights from ‘tennis’, which results in sentences such as “A man playing racket on court”. In our more recent work [2], we incorporate the embeddings directly within our language model. Specifically, we use <a href="https://nlp.stanford.edu/projects/glove/">GloVe embeddings</a> in the input and output of our language model. This implicitly enables the model to capture semantic similarity when describing unseen objects. This enables our model to generate sentences such as “A tennis player swinging a racket at a ball”. Additionally, incorporating the embeddings directly within the network makes our model end-to-end trainable.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/novel_image_captioning/image_2.png" alt="Dense word embeddings to capture image similarity" /><br />
<i>
Incorporate dense word embeddings in the language model to capture semantic similarity.
</i>
</p>
<h3 id="caption-model-and-forgetting-in-neural-networks">Caption model and forgetting in neural networks.</h3>
<p>We combine the outputs of the visual network and language model to a caption model. This model is similar to existing caption models which are also pre-trained on ImageNet. However, we observed that although the model is pre-trained on ImageNet, when the model is trained / tuned on the COCO image-caption dataset it tends to forget what it has seen before. The problem of forgetting in neural networks has also been observed by <a href="https://arxiv.org/abs/1312.6211">researchers at Montreal</a> as well as <a href="https://arxiv.org/abs/1612.00796">Google DeepMind</a> amongst others. In our work, we resolve this problem of forgetting using a joint training strategy.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/novel_image_captioning/image_3.png" alt="Joint training to overcome forgetting" /><br />
<i>
Share parameters and train jointly on different data/tasks to overcome "forgetting"
</i>
</p>
<p>Specifically, our network has three components: a visual recognition network, a caption model, and a language model. All three components share parameters and are jointly trained. During training, each batch of inputs contains some images with labels, a different set of images and captions, and some plain sentences. These three inputs train the different components of the network. Since the parameters are shared between the three components, the network is jointly trained to recognize objects in images, caption images and generate sentences. This joint training helps the network overcome the problem of forgetting, and enables the model to generate descriptions for many novel object categories.</p>
<h2 id="whats-next">What’s Next?</h2>
<p>One of the most common errors in our model comes from not recognizing objects, and one way to mitigate this is to use better visual features. Another common error comes from generating sentences which are not fluent (A cat and a cat on a bed) or may not appeal to “common sense” (e.g. ‘A woman is playing gymnastics’ is not particularly correct since one doesn’t “play” gymnastics). It would be interesting to develop solutions that can overcome these issues.</p>
<p>While in this work, we proposes joint training as a strategy to overcome the problem of forgetting, it might not always be possible to train on lots of different tasks and datasets. A different way to approach the problem would be to build a model that can learn to compose descriptions based on visual information and object labels. Such a model should also be able to integrate objects on the fly i.e. currently we pre-train our model on a select set of objects, we should also think about how we can incrementally train our model on new data about some new concepts. Solving some of these problems can help develop better and more robust visual description models.</p>
<p>[<a href="https://vsubhashini.github.io/noc_examples.html">Links to more examples</a>]</p>
<p>[<a href="http://vsubhashini.github.io/noc.html#code">Links to trained models and code</a>]</p>
<h3 id="examples">Examples</h3>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/novel_image_captioning/image_4.png" alt="Examples" />
</p>
<hr />
<p><strong>This blog post is based on the following research papers:</strong></p>
<p>[1] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. In CVPR, 2016.</p>
<p>[2] S. Venugopalan, L. A. Hendricks, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Captioning images with diverse objects. In CVPR, 2017.</p>
Tue, 08 Aug 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/08/08/novel-object-captioning/
http://bair.berkeley.edu/blog/2017/08/08/novel-object-captioning/Minibatch Metropolis-Hastings<p>Over the last few years we have experienced an enormous data deluge, which has
played a key role in the surge of interest in AI. A partial list of some large
datasets:</p>
<ul>
<li><a href="http://www.image-net.org/">ImageNet</a>, with over 14 million images for classification and object detection.</li>
<li><a href="https://grouplens.org/datasets/movielens/">Movielens</a>, with 20 million user ratings of movies for collaborative filtering.</li>
<li><a href="https://github.com/udacity/self-driving-car">Udacity’s</a> car dataset (at least 223GB) for training self-driving cars.</li>
<li><a href="https://techcrunch.com/2016/01/14/yahoo-releases-its-biggest-ever-machine-learning-dataset-to-the-research-community/">Yahoo’s</a> 13.5 TB dataset of user-news interaction for studying human behavior.</li>
</ul>
<p><a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">Stochastic Gradient Descent</a> (SGD) has been the engine fueling the
development of large-scale models for these datasets. SGD is remarkably
well-suited to large datasets: it estimates the gradient of the loss function on
a full dataset using only a fixed-sized minibatch, and updates a model many
times with each pass over the dataset.</p>
<p>But SGD has limitations. When we construct a model, we use a loss function
$L_\theta(x)$ with dataset $x$ and model parameters $\theta$ and attempt to
minimize the loss by gradient descent on $\theta$. This shortcut approach makes
optimization easy, but is vulnerable to a variety of problems including
over-fitting, excessively sensitive coefficient values, and possibly slow
convergence. A more robust approach is to treat the inference problem for
$\theta$ as a full-blown posterior inference, deriving a joint distribution
$p(x,\theta)$ from the loss function, and computing the posterior $p(\theta|x)$.
This is the Bayesian modeling approach, and specifically the Bayesian Neural
Network approach when applied to deep models. This recent <a href="http://bayesiandeeplearning.org/slides/nips16bayesdeep.pdf">tutorial by Zoubin
Ghahramani</a> discusses some of the advantages of this approach.</p>
<p>The model posterior $p(\theta|x)$ for most problems is intractable (no closed
form). There are two methods in Machine Learning to work around intractable
posteriors: <a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">Variational Bayesian methods</a> and <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov Chain Monte Carlo</a>
(MCMC). In variational methods, the posterior is approximated with a simpler
distribution (e.g. a normal distribution) and its distance to the true posterior
is minimized. In MCMC methods, the posterior is approximated as a sequence of
correlated samples (points or particle densities). Variational Bayes methods
have been widely used but often introduce significant error — see <a href="https://arxiv.org/abs/1603.02644">this recent
comparison with Gibbs Sampling</a>, also <a href="https://arxiv.org/abs/1312.6114">Figure 3 from the Variational
Autoencoder (VAE) paper</a>. Variational methods are also more computationally
expensive than direct parameter SGD (it’s a small constant factor, but a small
constant times 1-10 days can be quite important).</p>
<p>MCMC methods have no such bias. You can think of MCMC particles as rather like
quantum-mechanical particles: you only observe individual instances, but they
follow an arbitrarily-complex joint distribution. By taking multiple samples you
can infer useful statistics, apply regularizing terms, etc. But MCMC methods
have one over-riding problem with respect to large datasets: other than the
important class of conjugate models which admit Gibbs sampling, there has been
no efficient way to do the Metropolis-Hastings tests required by general MCMC
methods on minibatches of data (we will define/review MH tests in a moment). In
response, researchers had to design models to make inference tractable, e.g.
<a href="https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine">Restricted Boltzmann Machines</a> (RBMs) use a layered, undirected design to
make Gibbs sampling possible. In a recent breakthrough, <a href="https://arxiv.org/abs/1312.6114">VAEs</a> use
variational methods to support more general posterior distributions in
probabilistic auto-encoders. But with VAEs, like other variational models, one
has to live with the fact that the model is a best-fit approximation, with
(usually) no quantification of how close the approximation is. Although they
typically offer better accuracy, MCMC methods have been sidelined recently in
auto-encoder applications, lacking an efficient scalable MH test.</p>
<!--more-->
<p>A bridge between SGD and Bayesian modeling has been forged recently by papers on
<a href="https://www.ics.uci.edu/~welling/publications/papers/stoclangevin_v6.pdf">Stochastic Gradient Langevin Dynamics</a> (SGLD) and <a href="https://arxiv.org/abs/1402.4102">Stochastic Gradient
Hamiltonian Monte Carlo</a> (SGHMC). These methods involve minor variations to
typical SGD updates which generate samples from a probability distribution which
is approximately the Bayesian model posterior $p(\theta|x)$. These approaches
turn SGD into an MCMC method, and as such require Metropolis-Hastings (MH) tests
for accurate results, the topic of this blog post.</p>
<p>Because of these developments, interest has warmed recently in scalable MCMC and
in particular in doing the MH tests required by general MCMC models on large
datasets. Normally an MH test requires a scan of the full dataset and is applied
each time one wants a data sample. Clearly for large datasets, it’s intractable
to do this. Two papers from ICML 2014, <a href="https://arxiv.org/abs/1304.5299">Korattikara et al.</a> and <a href="http://proceedings.mlr.press/v32/bardenet14.html">Bardenet et
al.</a>, attempt to reduce the cost of MH tests. They both use concentration
bounds, and both achieve constant-factor improvements relative to a full dataset
scan. <a href="http://www.jmlr.org/papers/v18/15-205.html">Other recent work</a> improves performance but makes even stronger
assumptions about the model which limits applicability, especially for deep
networks. None of these approaches come close to matching the performance of
SGD, i.e. generating a posterior sample from small constant-size batches of
data.</p>
<p>In this post we describe a new approach to MH testing which moves the cost of MH
testing from $O(N)$ to $O(1)$ relative to dataset size. It avoids the need for
global statistics and does not use tail bounds (which lead to long-tailed
distributions for the amount of data required for a test). Instead we use a
novel correction distribution to directly “morph” the distribution of a noisy
minibatch estimator into a smooth MH test distribution. Our method is a true
“black-box” method which provides estimates on the accuracy of each MH test
using only data from a small expected size minibatch. It can even be applied to
unbounded data streams. It can be “piggy-backed” on existing SGD implementations
to provide full posterior samples (via SGLD or SGHMC) for almost the same cost
as SGD samples. Thus full Bayesian neural network modeling is now possible for
about the same cost as SGD optimization. Our approach is also a potential
substitute for variational methods and VAEs, providing unbiased posterior
samples at lower cost.</p>
<p>To explain the approach, we review the role of MH tests in MCMC models.</p>
<h1 id="markov-chain-monte-carlo-review">Markov Chain Monte Carlo Review</h1>
<h2 id="markov-chains">Markov Chains</h2>
<p>MCMC methods are designed to sample from a target distribution which is
difficult to compute. To generate samples, they utilize Markov Chains, which
consist of nodes representing states of the system and probability distributions
for transitioning from one state to another.</p>
<p>A key concept is the <em>Markovian assumption</em>, which states that the probability
of being in a state at time $t+1$ can be inferred entirely based on the current
state at time $t$. Mathematically, letting $\theta_t$ represent the current
state of the Markov chain at time $t$, we have $p(\theta_{t+1} | \theta_t,
\ldots, \theta_0) = p(\theta_{t+1} | \theta_t)$. By using these probability
distributions, we can generate a <em>chain of samples</em> $(\theta_i)_{i=1}^T$ for
some large $T$.</p>
<p>Since the probability of being in state $\theta_{t+1}$ directly depends on
$\theta_t$, the samples are <em>correlated</em>. Rather surprisingly, it can be shown
that, under mild assumptions, in the limit of many samples the distribution of
the chain’s samples approximates the target distribution.</p>
<p>A full review of MCMC methods is beyond the scope of this post, but a good
reference is the <a href="http://www.mcmchandbook.net/">Handbook of Markov Chain Monte Carlo (2011)</a>. Standard
machine learning textbooks such as <a href="https://mitpress.mit.edu/books/probabilistic-graphical-models">Koller & Friedman (2009)</a> and <a href="https://mitpress.mit.edu/books/machine-learning-0">Murphy
(2012)</a> also cover MCMC methods.</p>
<h2 id="metropolis-hastings">Metropolis-Hastings</h2>
<p>One of the most general and powerful MCMC methods is
<a href="https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm"><strong>Metropolis-Hastings</strong></a>. This uses a test to <em>filter</em> samples. To define
it properly, let $p(\theta)$ be the <em>target distribution</em> we want to
approximate. In general, it’s intractable to sample directly from it.
Metropolis-Hastings uses a simpler <em>proposal distribution</em> $q(\theta’ | \theta)$
to generate samples. Here, $\theta$ represents our <em>current</em> sample in the
chain, and $\theta’$ represents the proposed sample. For simple cases, it’s
common to use a Gaussian proposal centered at $\theta$.</p>
<p>If we were to just use a Gaussian to generate samples in our chain, there’s no
way we could approximate our target $p$, since the samples would form a random
walk. The MH test cleverly resolves this by <em>filtering</em> samples with the
following test. Draw a uniform random variable $u \in [0,1]$ and determine
whether the following is true:</p>
<script type="math/tex; mode=display">% <![CDATA[
u \;{\overset{?}{<}}\; \min\left\{\frac{p(\theta')q(\theta | \theta')}{p(\theta)q(\theta' | \theta)}, 1\right\} %]]></script>
<p>If true, we <em>accept</em> $\theta’$. Otherwise, we <em>reject and reuse</em> the old sample
$\theta$. Notice that</p>
<ul>
<li>It doesn’t require knowledge of a normalizing constant (independent of
$\theta$ and $\theta’$), because that cancels out in the
$p(\theta’)/p(\theta)$ ratio. This is great, because normalizing constants are
arguably the biggest reason why distributions become intractable.</li>
<li>The higher the value of $p(\theta’)$, the more likely we are to accept.</li>
</ul>
<p>To get more intuition on how the test works, we’ve created the following figure
from <a href="https://github.com/DanielTakeshi/MCMC_and_Dynamics/blob/master/standard_mcmc/Quick_MH_Test_Example.ipynb">this Jupyter Notebook</a>, showing the progression of samples to
approximate a target posterior. This example is derived from <a href="https://www.ics.uci.edu/~welling/publications/papers/stoclangevin_v6.pdf">Welling & Teh
(2011)</a>.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/mh_test/mhtest_example_progression.png" alt="jupyter_notebook" /><br />
<i>
A quick example of the MH test in action on a mixture of Gaussians example. The
parameter is $\theta \in \mathbb{R}^2$ with the x and y axes representing
$\theta_1$ and $\theta_2$, respectively. The target posterior has contours shown
in the fourth plot; the probability mass is concentrated in the diagonal between
points $(0,1)$ and $(1,-1)$. (This posterior depends on sampled Gaussians.) The
plots show the progression of the MH test after 50, 500, and 5000 samples in our
MCMC chain. After 5000 samples, it's clear that our samples are concentrated in
the regions with higher posterior probability.
</i>
</p>
<h1 id="reducing-metropolis-hastings-data-usage">Reducing Metropolis-Hastings Data Usage</h1>
<p>What happens when we consider the Bayesian posterior inference case with large
datasets? (Perhaps we’re interested in the same example in the figure above,
except that the posterior is based on more data points.) Then our goal is to
sample to approximate the distribution $p(\theta | x_1, \ldots, x_N)$ for large
$N$. By Bayes’ rule, this is $\frac{p_0(\theta)p(x_1, \ldots, x_N | \theta)
}{p(x_1,\ldots,x_N)}$ where $p_0$ is the prior. We additionally assume that the
$x_i$ are conditionally independent given $\theta$. The MH test therefore
becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
u \;{\overset{?}{<}}\; \min\left\{\frac{p_0(\theta')\prod_{i=1}^Np(x_i|\theta') q(\theta |
\theta')}{p_0(\theta) \prod_{i=1}^Np(x_i|\theta) q(\theta' | \theta)}, 1\right\} %]]></script>
<p>Or, after taking logarithms and rearranging (while ignoring the minimum
operator, which technically isn’t needed here), we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\log\left(u\frac{q(\theta'|\theta)p_0(\theta)}{q(\theta|\theta')p_0(\theta')}\right)
\;{\overset{?}{<}}\;
\sum_{i=1}^N \log\frac{p(x_i|\theta')}{p(x_i|\theta)} %]]></script>
<p>The problem now is apparent: it’s expensive to compute all the $p(x_i |
\theta’)$ terms, and this has to be done <em>every time we sample</em> since it depends
on $\theta’$.</p>
<p>The naive way to deal with this is to apply the same test, but with a minibatch
of $b$ elements:</p>
<script type="math/tex; mode=display">% <![CDATA[
\log\left(u\frac{q(\theta'|\theta)p_0(\theta)}{q(\theta|\theta')p_0(\theta')}\right)
\;{\overset{?}{<}}\;
\frac{N}{b} \sum_{i=1}^b \log\frac{p(x_i^*|\theta')}{p(x_i^*|\theta)} %]]></script>
<p>Unfortunately, this won’t sample from the correct target distribution; see
Section 6.1 in <a href="http://www.jmlr.org/papers/v18/15-205.html">Bardenet et al. (2017)</a> for details.</p>
<p>A better strategy is to start with the same batch of $b$ points, but then gauge
the <em>confidence</em> of the batch test relative to using the full data. If, after
seeing $b$ points, we already know that our proposed sample $\theta’$ is
significantly worse than our current sample $\theta$, then we should reject
right away. If $\theta’$ is significantly better, we should accept. If it’s
ambiguous, then we increase the size of our test batch, perhaps to $2b$
elements, and then measure the test’s confidence. Lather, rinse, repeat. As
mentioned earlier, <a href="https://arxiv.org/abs/1304.5299">Korattikara et al. (2014)</a> and <a href="http://proceedings.mlr.press/v32/bardenet14.html">Bardenet et al.
(2014)</a> developed algorithms following this framework.</p>
<p>A weakness of the above approach is that it’s doing repeated testing and one
must reduce the allowable test error each time one increments the test batch
size. Unfortunately, there is also a significant probability that the approaches
above will grow the test batch all the way to the full dataset, and they offer
at most constant factor speedups over testing the full dataset.</p>
<h1 id="minibatch-metropolis-hastings-our-contribution">Minibatch Metropolis-Hastings: Our Contribution</h1>
<h2 id="change-the-acceptance-function">Change the Acceptance Function</h2>
<p>To set up our test, we first define the log transition probability ratio
$\Delta$:</p>
<script type="math/tex; mode=display">\Delta(\theta,\theta') = \log \frac{p_0(\theta')\prod_{i=1}^N p(x_i |
\theta')q(\theta | \theta')}{p_0(\theta)\prod_{i=1}^N p(x_i | \theta)q(\theta' |
\theta)}</script>
<p>This log ratio factors into a sum of per-sample terms, so when we approximate
its value by computing on a minibatch we get an unbiased estimator of its
full-data value plus some noise (which is asymptotically normal by the Central
Limit Theorem).</p>
<p>The first step for applying our MH test is to use a different acceptance
function. Expressed in terms of $\Delta$, the classical MH accepts a transition
with probability given by the blue curve.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/mh_test/different_tests.png" alt="different_tests" width="600" /><br />
<i>
Functions $f$ and $g$ can serve as acceptance tests for Metropolis-Hastings.
Given current sample $\theta$ and proposed sample $\theta'$, the vertical axis
represents the probability of accepting $\theta'$.
</i>
</p>
<p>Instead of using the classical test, we’ll use the sigmoid function. It might
not be apparent why this is allowed, but there’s some elegant theory that
explains why using this alternative function <em>as the acceptance test for MH</em>
still results in the correct semantics of MCMC. That is, under the same mild
assumptions, the distribution of samples $(\theta_i)_{i=1}^T$ approaches the
target distribution.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/mh_test/equivalent_test.png" alt="equivalent_test" width="500" /><br />
<i>
The density of the standard logistic random variable, denoted $X_{\rm log}$
along with the equivalent MH test expression ($X_{\rm log}+\Delta > 0$) with the
sigmoid acceptance function.
</i>
</p>
<p>Our acceptance test is now the sigmoid function. Note that the sigmoid function
is the <em>cumulative distribution function</em> of a (standard) <a href="https://en.wikipedia.org/wiki/Logistic_distribution">Logistic random
variable</a>; the figure above plots the density. One can show that the MH test
under the sigmoid acceptance function reduces to determining whether $X_{\rm
\log} + \Delta > 0$ for a sampled $X_{\rm log}$ value.</p>
<h2 id="new-mh-test">New MH Test</h2>
<p>This is nice, but we don’t want to compute $\Delta$ because it depends on all
$p(x_i | \theta’)$ terms. When we estimate $\Delta$ using a minibatch, we
introduce an additive error which is approximately normal, $X_{\rm normal}$. The
key observation in our work is that the distribution of the minibatch estimate
of $\Delta$ (approximately Gaussian) is already very close to the desired test
distribution $X_{\rm log}$, as shown below.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/mh_test/gaussian_logistic_cdf.png" alt="gaussian_logistic_cdf" width="400" /><br />
<i>
A plot of the logistic CDF in red (as we had earlier) along with a normal CDF
curve, colored in lime, which corresponds to a standard deviation of 1.7.
</i>
</p>
<p>Rather than resorting to tail bounds as in prior work, we directly bridge these
two distributions using an additive correction variable $X_{\rm correction}$:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/mh_test/our_test_visual.png" alt="test_visual" /><br />
<i>
A diagram of our minibatch MH test. On the right we have the full data test that
we want, but we can't use it since $\Delta$ is intractable. Instead, we have
$\Delta + X_{\rm normal}$ (from the left side) and must add a correction $X_{\rm
correction}$.
</i>
</p>
<p>We want to make the LHS and RHS distributions equal, so we add in a correction
$X_{\rm correction}$ which is a symmetric random variable centered at zero.
Adding independent random variables gives a random variable whose distribution
is the convolution of the summands’ distributions. So finding the correction
distribution involves “deconvolution” of a logistic and normal distribution.
It’s not always possible to do this, and several conditions must be met (e.g.
the tails of the normal distribution must be weaker than the logistic) but
luckily for us they are. <a href="https://arxiv.org/abs/1610.06848">In our paper</a> to appear at UAI 2017 we show that
the correction distribution can be approximated to essentially single-precision
floating-point precision by tabulation.</p>
<p>In our paper, we also prove theoretical results bounding the error of our test,
and present experimental results showing that our method results in accurate
posterior estimation for a Gaussian Mixture Model, and that it is also highly
sample-efficient in Logistic Regression for classification of MNIST digits.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/mh_test/gaussian_mixture_histogram_results_v8.png" alt="paper_results" /><br />
<i>
Histograms showing the batch sizes used for Metropolis-Hastings for the three
algorithms benchmarked in our paper. The posterior is similar to the earlier
example from the Jupyter Notebook, except generated with one million data
points. Left is our result, the other two are from <a href="https://arxiv.org/abs/1304.5299">Korattikara et al. (2014)</a>, and <a href="http://proceedings.mlr.press/v32/bardenet14.html">Bardenet et al.
(2014)</a>, respectively. Our algorithm uses an average of just 172 data points
each iteration. Note the log-log scale of the histograms.
</i>
</p>
<p>We hope our test is useful to other researchers who are looking to use MCMC
methods in large datasets. We’ve also <a href="https://github.com/BIDData/BIDMach/blob/master/src/main/scala/BIDMach/updaters/MHTest.scala">implemented an open-source version of the
test</a> as part of the <a href="https://github.com/BIDData/BIDMach">BIDMach machine learning library</a> developed at UC
Berkeley.</p>
<hr />
<p>I thank co-authors Xinlei Pan, Haoyu Chen, and especially, John “The Edge” Canny
for their help on this project.</p>
<ul>
<li><a href="https://arxiv.org/abs/1610.06848">An Efficient Minibatch Acceptance Test for Metropolis-Hastings</a>.<br />
Daniel Seita, Xinlei Pan, Haoyu Chen, John Canny.<br />
<em>Uncertainty in Artificial Intelligence</em>, 2017.</li>
</ul>
Wed, 02 Aug 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/08/02/minibatch-metropolis-hastings/
http://bair.berkeley.edu/blog/2017/08/02/minibatch-metropolis-hastings/Learning to Learn<p>A key aspect of intelligence is versatility – the capability of doing many
different things. Current AI systems excel at mastering a single skill, such as
Go, Jeopardy, or even helicopter aerobatics. But, when you instead ask an AI
system to do a variety of seemingly simple problems, it will struggle. A
champion Jeopardy program cannot hold a conversation, and an expert helicopter
controller for aerobatics cannot navigate in new, simple situations such as
locating, navigating to, and hovering over a fire to put it out. In contrast, a
human can act and adapt intelligently to a wide variety of new, unseen
situations. How can we enable our artificial agents to acquire such versatility?</p>
<p>There are several techniques being developed to solve these sorts of problems
and I’ll survey them in this post, as well as discuss a recent technique from
our lab, called <a href="http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/#model-agnostic-meta-learning-maml">model-agnostic
meta-learning</a>. (You can check out the <a href="https://arxiv.org/abs/1703.03400">research paper here</a>, and the code
for the <a href="https://github.com/cbfinn/maml">underlying technique here</a>.)</p>
<p>Current AI systems can master a complex skill from scratch, using an
understandably large amount of time and experience. But if we want our agents to
be able to acquire many skills and adapt to many environments, we cannot afford
to train each skill in each setting from scratch. Instead, we need our agents to
learn how to learn new tasks faster by reusing previous experience, rather than
considering each new task in isolation. This approach of learning to learn, or
meta-learning, is a key stepping stone towards versatile agents that can
continually learn a wide variety of tasks throughout their lifetimes.</p>
<h3 id="so-what-is-learning-to-learn-and-what-has-it-been-used-for">So, what is learning to learn, and what has it been used for?</h3>
<!--more-->
<p>Early approaches to meta-learning date back to the late 1980s and early 1990s,
including <a href="http://people.idsia.ch/~juergen/diploma.html">Jürgen Schmidhuber’s thesis</a> and <a href="http://bengio.abracadoudou.com/publications/pdf/bengio_1991_ijcnn.pdf">work by Yoshua and Samy
Bengio</a>. Recently meta-learning has become a hot topic, with a flurry of
recent papers, most commonly using the technique for <a href="https://arxiv.org/abs/1502.03492">hyperparameter</a> and
<a href="https://arxiv.org/abs/1703.00441">neural</a> <a href="https://arxiv.org/abs/1703.04813">network</a> <a href="http://www.cantab.net/users/yutian.chen/Publications/ChenEtAl_ICML17_L2L.pdf">optimization</a>, finding <a href="https://arxiv.org/abs/1611.01578">good</a> <a href="https://arxiv.org/abs/1611.02167">network</a>
<a href="https://arxiv.org/abs/1704.08792">architectures</a>, <a href="https://arxiv.org/abs/1606.04080">few</a>-<a href="https://openreview.net/forum?id=rJY0-Kcll">shot</a> <a href="https://arxiv.org/abs/1703.03400">image</a> <a href="https://arxiv.org/abs/1606.02819">recognition</a>, and
<a href="https://arxiv.org/abs/1611.02779">fast</a> <a href="https://arxiv.org/abs/1611.05763">reinforcement</a> <a href="https://arxiv.org/abs/1703.03400">learning</a>.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/maml/banner.jpg" alt="maml" /><br />
<i>Various recent meta-learning approaches.</i>
</p>
<h3 id="few-shot-learning">Few-Shot Learning</h3>
<p><img src="http://bair.berkeley.edu/blog/assets/maml/segway.jpg" alt="maml" width="160" hspace="30" align="right" />
In 2015, <a href="https://www.cs.cmu.edu/~rsalakhu/papers/LakeEtAl2015Science.pdf">Brendan Lake et al.</a> published a paper that challenged modern machine
learning methods to be able to learn new concepts from one or a few instances of
that concept. As an example, Lake suggested that humans can learn to identify
“novel two-wheel vehicles” from a single picture (e.g. as shown on the right),
whereas machines cannot generalize a concept from just a single image. (Humans
can also draw a character in a new alphabet after seeing just one example).
Along with the paper, Lake included a dataset of handwritten characters,
<a href="https://github.com/brendenlake/omniglot">Omniglot</a>, the “transpose” of <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a>, with 1623 character classes,
each with 20 examples. Two deep learning models quickly followed with papers at
ICML 2016 that used <a href="http://proceedings.mlr.press/v48/santoro16.pdf">memory-augmented neural networks</a> and <a href="https://arxiv.org/abs/1603.05106">sequential
generative models</a>; showing it is possible for deep models to learn to learn
from a few examples, though not yet at the level of humans.</p>
<h1 id="how-recent-meta-learning-approaches-work">How Recent Meta-learning Approaches Work</h1>
<p>Meta-learning systems are trained by being exposed to a large number of tasks
and are then tested in their ability to learn new tasks; an example of a task
might be classifying a new image within 5 possible classes, given one example of
each class, or learning to efficiently navigate a new maze with only one
traversal through the maze. This differs from many standard machine learning
techniques, which involve training on a single task and testing on held-out
examples from that task.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/maml/meta_example.png" alt="maml" width="600" /><br />
<i>Example meta-learning set-up for few-shot image classification, visual
adapted from <a href="https://openreview.net/forum?id=rJY0-Kcll">Ravi & Larochelle ‘17</a>.</i>
</p>
<p>During meta-learning, the model is trained to learn tasks in the meta-training
set. There are two optimizations at play – the learner, which learns new tasks,
and the meta-learner, which trains the learner. Methods for meta-learning
have typically fallen into one of three categories: recurrent models, metric
learning, and learning optimizers.</p>
<p><strong>Recurrent Models</strong></p>
<p>These approaches train a recurrent model, e.g. an <a href="http://www.bioinf.jku.at/publications/older/2604.pdf">LSTM</a>, to take in the
dataset sequentially and then process new inputs from the task., In an image
classification setting, this might involve passing in the set of (image, label)
pairs of a dataset sequentially, followed by new examples which must be
classified.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/maml/recurrent_models.png" alt="maml" width="600" /><br />
<i>Recurrent model approach for inputs $\mathbf{x}_t$ and corresponding labels
$y_t$, figure from <a href="http://proceedings.mlr.press/v48/santoro16.pdf">Santoro et al. '16</a>.</i>
</p>
<p>The meta-learner uses gradient descent, whereas the learner simply rolls out the
recurrent network. This approach is one of the most general approaches and has
been used for <a href="http://proceedings.mlr.press/v48/santoro16.pdf">few-shot classification and regression</a>, <a href="https://arxiv.org/abs/1707.03141">and</a>
<a href="https://arxiv.org/abs/1611.02779">meta-reinforcement</a> <a href="https://arxiv.org/abs/1611.05763">learning</a>. Due to its flexibility, this approach
also tends to be less (meta-)efficient than other methods because the learner
network needs to come up with its learning strategy from scratch.</p>
<p><strong>Metric Learning</strong></p>
<p>This approach involves learning a metric space in which learning is particularly
efficient. This approach has mostly been used for few-shot classification.
Intuitively, if our goal is to learn from a small number of example images, than
a simple approach is to compare the image that you are trying to classify with
the example images that you have. But, as you might imagine, comparing images in
pixel space won’t work well. Instead, you can train a <a href="https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf">Siamese network</a> or
perform comparisons in a <a href="https://arxiv.org/abs/1606.04080">learned metric space</a>. Like the previous approach,
meta-learning is performed using gradient descent (or your favorite neural
network optimizer), whereas the learner corresponds to a comparison scheme, e.g.
nearest neighbors, in the meta-learned metric space. These approaches work
<a href="https://arxiv.org/abs/1703.05175">quite</a> <a href="https://arxiv.org/abs/1703.00767">well</a> for few-shot classification, though they have yet to be
demonstrated in other meta-learning domains such as regression or reinforcement
learning.</p>
<p><strong>Learning Optimizers</strong></p>
<p>The final approach is to <a href="http://snowedin.net/tmp/Hochreiter2001.pdf">learn an optimizer</a>. In this method, there is one
network (the meta-learner) which learns to update another network (the learner)
so that the learner effectively learns the task. This approach has been
extensively studied for <a href="https://arxiv.org/abs/1606.01885">better</a> <a href="https://arxiv.org/abs/1606.04474">neural</a> <a href="https://arxiv.org/abs/1703.00441">network</a>
<a href="https://arxiv.org/abs/1703.04813">optimization</a>. The meta-learner is typically a recurrent network so that it
can remember how it previously updated the learner model. The meta-learner can
be trained with reinforcement learning or supervised learning. <a href="https://openreview.net/forum?id=rJY0-Kcll">Ravi &
Larochelle</a> recently demonstrated this approach’s merit for few-shot image
classification, presenting the view that the learner model is an optimization
process that should be learned.</p>
<h1 id="learning-initializations-as-meta-learning">Learning Initializations as Meta-Learning</h1>
<p>Arguably, the biggest success story of transfer learning has been initializing
vision network weights <a href="http://proceedings.mlr.press/v32/donahue14.pdf">using ImageNet pre-training</a>. In particular, when
approaching any new vision task, the well-known paradigm is to first collect
labeled data for the task, acquire a network pre-trained on ImageNet
classification, and then fine-tune the network on the collected data using
gradient descent. Using this approach, neural networks can more effectively
learn new image-based tasks from modestly-sized datasets. However, pre-training
only goes so far. Because the last layers of the network still need to be
heavily adapted to the new task, datasets that are too small, as in the few-shot
setting, will still cause severe overfitting. Furthermore, we unfortunately
don’t have an analogous pre-training scheme for non-vision domains such as
speech, language, and control.<sup id="fnref:pre_training"><a href="#fn:pre_training" class="footnote">1</a></sup> Is there something to learn from
the remarkable success of ImageNet fine-tuning?</p>
<h2 id="model-agnostic-meta-learning-maml">Model-Agnostic Meta-Learning (MAML)</h2>
<p>What if we directly optimized for an initial representation that can be
effectively fine-tuned from a small number of examples? This is exactly the idea
behind our recently-proposed algorithm, model-agnostic meta-learning (MAML).
Like other meta-learning methods, MAML trains over a wide range of tasks. It
trains for a representation that can be quickly adapted to a new task, via a few
gradient steps. The meta-learner seeks to find an initialization that is not
only useful for adapting to various problems, but also can be adapted quickly
(in a small number of steps) and efficiently (using only a few examples). Below
is a visualization – suppose we are seeking to find a set of parameters
$\theta$ that are highly adaptable. During the course of meta-learning (the bold
line), MAML optimizes for a set of parameters such that when a gradient step is
taken with respect to a particular task $i$ (the gray lines), the parameters are
close to the optimal parameters $\theta_i^*$ for task $i$.</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/maml/maml.png" alt="maml" width="400" /><br />
<i>Diagram of the MAML approach.</i>
</p>
<p>This approach is quite simple, and has a number of advantages. It doesn’t make
any assumptions on the form of the model. It is quite efficient – there are no
additional parameters introduced for meta-learning and the learner’s strategy
uses a known optimization process (gradient descent), rather than having to come
up with one from scratch. Lastly, it can be easily applied to a number of
domains, including classification, regression, and reinforcement learning.</p>
<p>Despite the simplicity of the approach, we were surprised to find that the
method was able to substantially outperform a number of existing approaches on
popular few-shot image classification benchmarks, Omniglot and
MiniImageNet<sup id="fnref:mini_image"><a href="#fn:mini_image" class="footnote">2</a></sup>, including existing approaches that were much more
complex or domain specific. Beyond classification, we also tried to learn how
to adapt a simulated robot’s behavior to different goals, akin to the motivation
at the top of this blog post – versatility. To do so, we combined MAML with
policy gradient methods for reinforcement learning. MAML discovered a policy
which let a simulated robot adapt its locomotion direction and speed in a single
gradient update. See videos below:</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/maml/cheetah_direc.gif" alt="maml" /><br />
<i>MAML on HalfCheetah.</i>
</p>
<p style="text-align:center;">
<img src="http://bair.berkeley.edu/blog/assets/maml/ant_maml.gif" alt="maml" /><br />
<i>MAML on Ant.</i>
</p>
<p>The generality of the method — it can be combined with any model smooth
enough for gradient-based optimization — makes MAML applicable to a wide
range of domains and learning objectives beyond those explored in the paper.</p>
<p>We hope that MAML’s simple approach for effectively teaching agents to adapt to
variety of scenarios will bring us one step closer towards developing versatile
agents that can learn a variety of skills in real world settings.</p>
<hr />
<p><em>I would like to thank Sergey Levine and Pieter Abbeel for their valuable
feedback.</em></p>
<p><strong>This last part of this post was based on the following research paper</strong>:</p>
<ul>
<li><a href="https://arxiv.org/abs/1703.03400">Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks</a>. <br />
C. Finn, P. Abbeel, S. Levine. In ICML, 2017. (<a href="https://arxiv.org/pdf/1703.03400.pdf">pdf</a>, <a href="https://github.com/cbfinn/maml">code</a>)</li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:pre_training">
<p>Though, researchers have developed domain-agnostic
initialization schemes to encourage <a href="http://proceedings.mlr.press/v9/glorot10a.html">well</a>-<a href="https://arxiv.org/abs/1602.07868">conditioned</a>
<a href="https://arxiv.org/abs/1312.6120">gradients</a> and using <a href="https://arxiv.org/abs/1511.06856">data-dependent</a> <a href="https://arxiv.org/abs/1511.06422">normalization</a>. <a href="#fnref:pre_training" class="reversefootnote">↩</a></p>
</li>
<li id="fn:mini_image">
<p>Introduced by Vinyals et al. ‘16 and Ravi & Larochelle ‘17, the
MiniImageNet benchmark is the same as Omniglot but uses real RGB images from
a subset of the ImageNet dataset. <a href="#fnref:mini_image" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Tue, 18 Jul 2017 02:00:00 -0700
http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/
http://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/