The Berkeley Artificial Intelligence Research BlogThe BAIR Blog
http://bair.berkeley.edu/blog/
Sequence Modeling Solutions<br> for Reinforcement Learning Problems<!-- twitter -->
<meta name="twitter:title" content="Sequence Modeling Solutions for RL Problems" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/trajectory_transformer/humanoid_padded.png" />
<meta name="keywords" content="trajectory, transformer, reinforcement, learning, RL" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Michael Janner" />
<title>Sequence Modeling Solutions for Reinforcement Learning Problems</title>
<!-- begin section I: introduction -->
<p style="text-align:center; margin-top:-40px;">
<br />
<video width="100%" autoplay="" playsinline="" muted="">
<source src="https://bair.berkeley.edu/static/blog/trajectory_transformer/rollout_transformer_compressed.mp4" type="video/mp4" />
</video>
<video width="100%" autoplay="" playsinline="" muted="">
<source src="https://bair.berkeley.edu/static/blog/trajectory_transformer/rollout_single_compressed.mp4" type="video/mp4" />
</video>
<p width="80%" style="text-align:center; margin-left:10%; margin-right:10%; padding-bottom: -10px;">
<i style="font-size: 0.9em;">
Long-horizon predictions of (top) the <b><span style="color:#D62728;">Trajectory Transformer</span></b> compared to those of (bottom) a <b><span style="color:#D62728;">single-step</span></b> dynamics model.
</i>
</p>
<br />
<p>
Modern <a href="https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html">machine</a> <a href="https://arxiv.org/abs/1807.03748">learning</a> <a href="https://www.nature.com/articles/s41586-021-03819-2">success</a> <a href="https://arxiv.org/abs/2002.05709">stories</a> often have one thing in common: they use methods that scale gracefully with ever-increasing amounts of data.
This is particularly clear from recent advances in sequence modeling, where simply increasing the size of a stable architecture and its training set leads to <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">qualitatively</a> <a href="https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf">different</a> <a href="https://arxiv.org/abs/2005.14165">capabilities</a>.<sup id="fnref:anderson"><a href="#fn:anderson" class="footnote"><font size="-2">1</font></a></sup>
</p>
<p>
Meanwhile, the situation in reinforcement learning has proven more complicated.
While it has been possible to apply reinforcement learning algorithms to <a href="https://journals.sagepub.com/doi/full/10.1177/0278364917710318">large</a>-<a href="https://www.science.org/doi/10.1126/science.aar6404">scale</a> <a href="https://arxiv.org/abs/1912.06680">problems</a>, generally there has been much more friction in doing so.
In this post, we explore whether we can alleviate these difficulties by tackling the reinforcement learning problem with the toolbox of sequence modeling.
The end result is a generative model of trajectories that looks like a <a href="https://arxiv.org/abs/1706.03762">large language model</a> and a planning algorithm that looks like <a href="https://kilthub.cmu.edu/articles/journal_contribution/Speech_understanding_systems_summary_of_results_of_the_five-year_research_effort_at_Carnegie-Mellon_University_/6609821/1">beam search</a>.
Code for the approach can be found <a href="https://github.com/JannerM/trajectory-transformer">here</a>.
</p>
<!--more-->
<h3 id="models">The Trajectory Transformer</h3>
<p>
The standard framing of reinforcement learning focuses on decomposing a complicated long-horizon problem into smaller, more tractable subproblems, leading to dynamic programming methods like $Q$-learning and an emphasis on Markovian dynamics models.
However, we can also view reinforcement learning as analogous to a sequence generation problem, with the goal being to produce a sequence of actions that, when enacted in an environment, will yield a sequence of high rewards.
</p>
<p>
Taking this view to its logical conclusion, we begin by modeling the trajectory data provided to reinforcement learning algorithms with a Transformer architecture, the current tool of choice for natural language modeling.
We treat these trajectories as unstructured sequences of discretized states, actions, and rewards, and train the Transformer architecture using the standard cross-entropy loss.
Modeling all trajectory data with a single high-capacity model and scalable training objective, as opposed to separate procedures for dynamics models, policies, and $Q$-functions, allows for a more streamlined approach that removes much of the usual complexity.
</p>
<center>
<img width="80%" style="padding-top: 20px; padding-bottom: 20px" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/architecture.png" />
<br />
<p width="80%" style="text-align:center; margin-left:10%; margin-right:10%; padding-bottom: 10px;">
<i style="font-size: 0.9em;">
We model the distribution over $N$-dimensional states $\mathbf{s}_t$, $M$-dimensional actions $\mathbf{a}_t$, and scalar rewards $r_t$ using a Transformer architecture.
</i>
</p>
</center>
<!-- begin section II: models -->
<h3 id="models">Transformers as dynamics models</h3>
<p>
In many model-based reinforcement learning methods, compounding prediction errors cause long-horizon rollouts to be too unreliable to use for control, necessitating either <a href="https://arxiv.org/abs/1909.11652">short-horizon planning</a> or <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.6005&rep=rep1&type=pdf">Dyna-style</a> combinations of <a href="https://arxiv.org/abs/1906.08253">truncated model predictions and value functions</a>.
In comparison, we find that the Trajectory Transformer is a substantially more accurate long-horizon predictor than conventional single-step dynamics models.
</p>
<center>
<table>
<tr>
<th width="45%" style="border-top: 0px;">
<img src="https://bair.berkeley.edu/static/blog/trajectory_transformer/error_blog.png" width="100%" />
</th>
<th width="55%" style="border-top: 0px;">
<b style="font-size: 0.8em;">Transformer</b>
<br />
<img src="https://bair.berkeley.edu/static/blog/trajectory_transformer/outlines_transformer.png" width="100%" />
<b style="font-size: 0.8em;">Single-step</b>
<br />
<img src="https://bair.berkeley.edu/static/blog/trajectory_transformer/outlines_single_step.png" width="100%" />
<br />
</th>
</tr>
</table>
<div style="width: 90%;">
<p style="text-align:center;">
<i style="font-size: 0.9em;">Whereas the single-step model suffers from compounding errors that make its long-horizon predictions physically implausible, the Trajectory Transformer's predictions remain visually indistinguishable from <a href="https://people.eecs.berkeley.edu/~janner/trajectory-transformer/blog/outlines_reference.png">rollouts in the reference environment</a>.</i>
</p>
</div>
</center>
<br clear="left" />
<p>
This result is exciting because planning with learned models is notoriously finicky, with neural network dynamics models often being too inaccurate to benefit from more sophisticated planning routines.
A higher quality predictive model such as the Trajectory Transformer opens the door for importing effective trajectory optimizers that previously would have only served to <a href="https://arxiv.org/abs/1802.10592">exploit the learned model</a>.
</p>
<p>
We can also inspect the Trajectory Transformer as if it were a standard language model.
A common strategy in machine translation, for example, is to <a href="https://nlp.seas.harvard.edu/2018/04/03/attention.html">visualize the intermediate token weights</a> as a proxy for token dependencies.
The same visualization applied to here reveals two salient patterns:
</p>
<center>
<img width="30%" style="padding-top: 10px; padding-right: 60px;" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/markov.png" />
<img width="30%" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/striated.png" />
<br />
<p width="80%" style="text-align:center; margin-left:10%; margin-right:10%; padding-top: 20px; padding-bottom: 10px;">
<i style="font-size: 0.9em;">
Attention patterns of Trajectory Transformer, showing (left) a discovered <b><span style="color:#D62728;">Markovian stratetgy</span></b> and (right) an approach with <b><span style="color:#D62728;">action smoothing</span></b>.
</i>
</p>
</center>
<p>
In the first, state and action predictions depend primarily on the immediately preceding transition, resembling a learned Markov property.
In the second, state dimension predictions depend most strongly on the corresponding dimensions of all previous states, and action dimensions depend primarily on all prior actions.
While the second dependency violates the usual intuition of actions being a function of the prior state in behavior-cloned policies, this is reminiscent of the action smoothing used in some <a href="https://arxiv.org/abs/1909.11652">trajectory optimization algorithms</a> to enforce slowly varying control sequences.
</p>
<!-- begin section II: planning -->
<h3 id="planning">Beam search as trajectory optimizer</h3>
<p>
The simplest model-predictive control routine is composed of three steps: <b><span style="color:#D62728;">(1)</span></b> using a model to search for a sequence of actions that lead to a desired outcome; <b><span style="color:#D62728;">(2)</span></b> enacting the first<sup id="fnref:mpc"><a href="#fn:mpc" class="footnote"><font size="-2">2</font></a></sup> of these actions in the actual environment; and <b><span style="color:#D62728;">(3)</span></b> estimating the new state of the environment to begin step (1) again.
Once a model has been chosen (or trained), most of the important design decisions lie in the first step of that loop, with differences in action search strategies leading to a wide array of trajectory optimization algorithms.
</p>
<p>
Continuing with the theme of pulling from the sequence modeling toolkit to tackle reinforcement learning problems, we ask whether the go-to technique for decoding neural language models can also serve as an effective trajectory optimizer.
This technique, known as <a href="https://kilthub.cmu.edu/articles/journal_contribution/Speech_understanding_systems_summary_of_results_of_the_five-year_research_effort_at_Carnegie-Mellon_University_/6609821/1">beam search</a>, is a pruned breadth-first search algorithm that has found remarkably consistent use since the earliest days of computational linguistics.
We explore variations of beam search and instantiate its use a model-based planner in three different settings:
</p>
<div>
<ol>
<li><p>
<b><span style="color:#D62728;">Imitation:</span></b> If we use beam search without modification, we sample trajectories that are probable under the distribution of the training data. Enacting the first action in the generated plans gives us a long-horizon model-based variant of imitation learning.
</p></li>
<li><p>
<b><span style="color:#D62728;">Goal-conditioned RL:</span></b> Conditioning the Transformer on <i>future</i> desired context alongside previous states, actions, and rewards yields a goal-reaching method. This works by recontextualizing past data as optimal for some task, in the same spirit as <a href="https://arxiv.org/abs/1707.01495">hindsight relabeling</a>.
</p>
<center>
<img width="28%" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/0.png" />
<img width="28%" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/1.png" />
<img width="28%" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/2.png" />
<br />
<img width="2%" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/rolloutblack-1.png" />
Start
<img width="2%" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/rolloutblue-1.png" />
Goal
</center>
<p width="90%" style="text-align:center; padding-top: 20px; padding-bottom: 10px;">
<i style="font-size: 0.9em;">
Paths taken by the goal-conditioned beam-search planner in a four-rooms environment.
</i>
</p>
</li>
<li><p>
<b><span style="color:#D62728;">Offline RL:</span></b> If we replace transitions' log probabilities with their rewards (their <a href="https://arxiv.org/abs/1805.00909">log probability of optimality</a>), we can use the same beam search framework to optimize for reward-maximizing behavior.
We find that this simple combination of a trajectory-level sequence model and beam search decoding performs on par with the best prior offline reinforcement learning algorithms <i>without</i> the usual ingredients of standard offline reinforcement learning algorithm: <a href="https://arxiv.org/abs/1911.11361">behavior policy regularization</a> or explicit <a href="https://arxiv.org/abs/2006.04779">pessimism</a> in the case of model-free algorithms, or <a href="https://arxiv.org/abs/2005.05951">ensembles</a> or other <a href="https://arxiv.org/abs/2005.13239">epistemic uncertainty estimators</a> in the case of model-based algorithms. All of these roles are fulfilled by the same Transformer model and fall out for free from maximum likelihood training and beam-search decoding.
</p></li>
</ol>
</div>
<center>
<img width="80%" style="padding-top: 0px;" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/d4rl.png" />
<br />
<img width="80%" src="https://bair.berkeley.edu/static/blog/trajectory_transformer/bar.png" />
<br />
<p style="text-align:center; margin-left:10%; margin-right:10%; padding-top: 20px; padding-bottom: 10px;">
<i style="font-size: 0.9em;">
Performance on the locomotion environments in the <a href="https://arxiv.org/abs/2004.07219">D4RL offline benchmark suite.</a> We compare two variants of the Trajectory Transformer (TT) — differing in how they discretize continuous inputs — with model-based, value-based, and recently proposed sequence-modeling algorithms.
</i>
</p>
</center>
<br />
<!-- begin section II: outlook -->
<h3 id="model">What does this mean for reinforcement learning?</h3>
<p>
The Trajectory Transformer is something of an exercise in minimalism.
Despite lacking most of the common ingredients of a reinforcement learning algorithm, it performs on par with approaches that have been the result of much collective effort and tuning.
Taken together with the concurrent <a href="https://arxiv.org/abs/2106.01345">Decision Transformer</a>, this result highlights that scalable architectures and stable training objectives can sidestep some of the difficulties of reinforcement learning in practice.
</p>
<p>
However, the simplicity of the proposed approach gives it predictable weaknesses.
Because the Transformer is trained with a maximum likelihood objective, it is more dependent on the training distribution than a conventional dynamic programming algorithm.
Though there is value in studying the most streamlined approaches that can tackle reinforcement learning problems, it is possible that the most effective instantiation of this framework will come from combinations of the sequence modeling and reinforcement learning toolboxes.
</p>
<p>
We can get a preview of how this would work with a fairly straightforward combination: plan using the Trajectory Transformer as before, but use a $Q$-function trained via dynamic programming as a search heuristic to guide the beam search planning procedure.
We would expect this to be important in sparse-reward, long-horizon tasks, since these pose particularly difficult search problems.
To instantiate this idea, we use the $Q$-function from the <a href="https://arxiv.org/abs/2110.06169">implicit $Q$-learning</a> (IQL) algorithm and leave the Trajectory Transformer otherwise unmodified.
We denote the combination <b>TT</b>$_{\color{#999999}{(+Q)}}$:
</p>
<center>
<img width="80%" style="padding-top: 20px;" src="https://people.eecs.berkeley.edu/~janner/trajectory-transformer/blog/antmaze.png" />
<p style="text-align:center; margin-left:10%; margin-right:10%; padding-top: 20px; padding-bottom: 10px;">
<i style="font-size: 0.9em;">
Guiding the Trajectory Transformer's plans with a $Q$-function trained via dynamic programming (TT$_{\color{#999999}{(+Q)}}$) is a straightforward way of improving empirical performance compared to model-free (CQL, IQL) and return-conditioning (DT) approaches.
We evaluate this effect in the sparse-reward, long-horizon <a href="https://arxiv.org/abs/2004.07219">AntMaze goal-reaching tasks</a>.
</i>
</p>
</center>
<br />
<p>
Because the planning procedure only uses the $Q$-function as a way to filter promising sequences, it is not as prone to local inaccuracies in value predictions as policy-extraction-based methods like <a href="https://arxiv.org/abs/2006.04779">CQL</a> and <a href="https://arxiv.org/abs/2110.06169">IQL</a>.
However, it still benefits from the temporal compositionality of dynamic programming and planning, so outperforms return-conditioning approaches that rely more on complete demonstrations.
</p>
<p>
Planning with a terminal value function is a time-tested strategy, so $Q$-guided beam search is arguably the simplest way of combining sequence modeling with conventional reinforcement learning.
This result is encouraging not because it is new algorithmically, but because it demonstrates the empirical benefits even straightforward combinations can bring.
It is possible that designing a sequence model from the ground-up for this purpose, so as to retain the scalability of Transformers while incorporating the principles of dynamic programming, would be an even more effective way of leveraging the strengths of each toolkit.
</p>
<hr />
<p>
This post is based on the following paper:
</p>
<ul>
<li>
<a href="https://arxiv.org/abs/2106.02039"><strong>Offline Reinforcement Learning as One Big Sequence Modeling Problem</strong></a>
<br />
<a href="http://michaeljanner.com/">Michael Janner</a>, <a href="https://scholar.google.com/citations?user=qlwwdfEAAAAJ&hl=en">Qiyang Li</a>, and <a href="https://people.eecs.berkeley.edu/~svlevine/">Sergey Levine</a>
<br />
<em>Neural Information Processing Systems (NeurIPS), 2021.</em>
<br />
<a href="https://github.com/JannerM/trajectory-transformer">Open-source code</a>
</li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:anderson">
<p>
Though qualitative capabilities advances from scale alone might seem surprising, physicists have long known that <a href="https://cse-robotics.engr.tamu.edu/dshell/cs689/papers/anderson72more_is_different.pdf">more is different</a>.
<a href="#fnref:anderson" class="reversefootnote">↩</a>
</p>
</li>
<li id="fn:mpc">
<p>
You could also enact multiple actions from the sequence, or act according to a closed-loop controller until there has been enough time to generate a new plan.
<a href="#fnref:mpc" class="reversefootnote">↩</a>
</p>
</li>
</ol>
</div>
<hr />
</p>
Fri, 19 Nov 2021 01:00:00 -0800
http://bair.berkeley.edu/blog/2021/11/19/trajectory-transformer/
http://bair.berkeley.edu/blog/2021/11/19/trajectory-transformer/Which Mutual Information Representation Learning Objectives are Sufficient for Control?<!--
These are comments in HTML. The above header text is needed to format the
title, authors, etc. The "example_post" is an example representative image (not
GIF) that we use for each post for tweeting (see below as well) and for the
emails to subscribers. Please provide this image (and any other images and
GIFs) in the blog to the BAIR Blog editors directly.
The text directly below gets tweets to work. Please adjust according to your
post.
The `static/blog` directory is a location on the blog server which permanently
stores the images/GIFs in BAIR Blog posts. Each post has a subdirectory under
this for its images (titled `example_post` here, please change).
Keeping the post visbility as False will mean the post is only accessible if
you know the exact URL.
You can also turn on Disqus comments, but we recommend disabling this feature.
-->
<!-- twitter -->
<meta name="twitter:title" content="Which Mutual Information Representation Learning Objectives are Sufficient for Control?" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/overview.png" />
<meta name="keywords" content="reinforcement learning, representation learning" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Kate Rakelly" />
<!--
The actual text for the post content appears below. Text will appear on the
homepage, i.e., https://bair.berkeley.edu/blog/ but we only show part of the
posts on the homepage. The rest is accessed via clicking 'Continue'. This is
enforced with the `more` excerpt separator.
-->
<p>Processing raw sensory inputs is crucial for applying deep RL algorithms to real-world problems.
For example, autonomous vehicles must make decisions about how to drive safely given information flowing from cameras, radar, and microphones about the conditions of the road, traffic signals, and other cars and pedestrians.
However, direct “end-to-end” RL that maps sensor data to actions (Figure 1, left) can be very difficult because the inputs are high-dimensional, noisy, and contain redundant information.
Instead, the challenge is often broken down into two problems (Figure 1, right): (1) extract a representation of the sensory inputs that retains only the relevant information, and (2) perform RL with these representations of the inputs as the system state.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/image1.png" width="70%" />
<br />
<i><b>Figure 1. </b>Representation learning can extract compact representations of states for RL.</i>
</p>
<p>A wide variety of algorithms have been proposed to learn lossy state representations in an unsupervised fashion (see this recent <a href="https://icml.cc/virtual/2021/tutorial/10843">tutorial</a> for an overview).
Recently, contrastive learning methods have proven effective on RL benchmarks such as Atari and DMControl (<a href="https://arxiv.org/abs/1807.03748">Oord et al. 2018</a>, <a href="https://arxiv.org/abs/2009.08319">Stooke et al. 2020</a>, <a href="https://arxiv.org/abs/2106.04799">Schwarzer et al. 2021</a>), as well as for real-world robotic learning (<a href="https://arxiv.org/abs/2012.07975">Zhan et al.</a>).
While we could ask which objectives are better in which circumstances, there is an even more basic question at hand: are the representations learned via these methods guaranteed to be sufficient for control?
In other words, do they suffice to learn the optimal policy, or might they discard some important information, making it impossible to solve the control problem?
For example, in the self-driving car scenario, if the representation discards the state of stoplights, the vehicle would be unable to drive safely.
Surprisingly, we find that some widely used objectives are not sufficient, and in fact do discard information that may be needed for downstream tasks.</p>
<!--more-->
<h2 id="defining-the-sufficiency-of-a-state-representation">Defining the Sufficiency of a State Representation</h2>
<p>As introduced above, a state representation is a function of the raw sensory inputs that discards irrelevant and redundant information.
Formally, we define a state representation $\phi_Z$ as a stochastic mapping from the original state space $\mathcal{S}$ (the raw inputs from all the car’s sensors) to a representation space $\mathcal{Z}$: $p(Z | S=s)$.
In our analysis, we assume that the original state $\mathcal{S}$ is Markovian, so each state representation is a function of only the current state.
We depict the representation learning problem as a graphical model in Figure 2.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/graphical_model.png" width="30%" />
<br />
<i><b>Figure 2. </b>The representation learning problem in RL as a graphical model.</i>
</p>
<p>We will say that a representation is sufficient if it is guaranteed that an RL algorithm using that representation can learn the optimal policy.
We make use of a result from <a href="http://rbr.cs.umass.edu/aimath06/proceedings/P21.pdf">Li et al. 2006</a>, which proves that if a state representation is capable of representing the optimal $Q$-function, then $Q$-learning run with that representation as input is guaranteed to converge to the same solution as in the original MDP (if you’re interested, see Theorem 4 in that paper).
So to test if a representation is sufficient, we can check if it is able to represent the optimal $Q$-function.
Since we assume we don’t have access to a task reward during representation learning, to call a representation sufficient we require that it can represent the optimal $Q$-functions for all possible reward functions in the given MDP.</p>
<h2 id="analyzing-representations-learned-via-mi-maximization">Analyzing Representations learned via MI Maximization</h2>
<p>Now that we’ve established how we will evaluate representations, let’s turn to the methods of learning them.
As mentioned above, we aim to study the popular class of contrastive learning methods.
These methods can largely be understood as maximizing a mutual information (MI) objective involving states and actions.
To simplify the analysis, we analyze representation learning in isolation from the other aspects of RL by assuming the existence of an offline dataset on which to perform representation learning.
This paradigm of offline representation learning followed by online RL is becoming increasingly popular, particularly in applications such as robotics where collecting data is onerous (<a href="https://arxiv.org/abs/2012.07975">Zhan et al. 2020</a>, <a href="https://arxiv.org/abs/1911.12247">Kipf et al. 2020</a>).
Our question is therefore whether the objective is sufficient on its own, not as an auxiliary objective for RL.
We assume the dataset has full support on the state space, which can be guaranteed by an epsilon-greedy exploration policy, for example.
An objective may have more than one maximizing representation, so we call a representation learning <em>objective</em> sufficient if <em>all</em> the representations that maximize that objective are sufficient.
We will analyze three representative objectives from the literature in terms of sufficiency.</p>
<h3 id="representations-learned-by-maximizing-forward-information">Representations Learned by Maximizing “Forward Information”</h3>
<p>We begin with an objective that seems likely to retain a great deal of state information in the representation.
It is closely related to learning a forward dynamics model in latent representation space, and to methods proposed in prior works (<a href="https://arxiv.org/abs/1810.01257">Nachum et al. 2018</a>, <a href="https://arxiv.org/abs/2003.01086">Shu et al. 2020</a>, <a href="https://arxiv.org/abs/2007.05929">Schwarzer et al. 2021</a>): $J_{fwd} = I(Z_{t+1}; Z_t, A_t)$.
Intuitively, this objective seeks a representation in which the current state and action are maximally informative of the representation of the next state.
Therefore, everything predictable in the original state $\mathcal{S}$ should be preserved in $\mathcal{Z}$, since this would maximize the MI.
Formalizing this intuition, we are able to prove that all representations learned via this objective are guaranteed to be sufficient (see the proof of Proposition 1 in the paper).</p>
<p>While reassuring that $J_{fwd}$ is sufficient, it’s worth noting that any state information that is temporally correlated will be retained in representations learned via this objective, no matter how irrelevant to the task.
For example, in the driving scenario, objects in the agent’s field of vision that are not on the road or sidewalk would all be represented, even though they are irrelevant to driving.
Is there another objective that can learn sufficient but <em>lossier</em> representations?</p>
<h3 id="representations-learned-by-maximizing-inverse-information">Representations Learned by Maximizing “Inverse Information”</h3>
<p>Next, we consider what we term an “inverse information” objective: $J_{inv} = I(Z_{t+k}; A_t | Z_t)$.
One way to maximize this objective is by learning an inverse dynamics model – predicting the action given the current and next state – and many prior works have employed a version of this objective (<a href="https://arxiv.org/abs/1606.07419">Agrawal et al. 2016</a>, <a href="https://arxiv.org/abs/1611.07507">Gregor et al. 2016</a>, <a href="https://arxiv.org/abs/1804.10689">Zhang et al. 2018</a> to name a few).
Intuitively, this objective is appealing because it preserves all the state information that the agent can influence with its actions.
It therefore may seem like a good candidate for a sufficient objective that discards more information than $J_{fwd}$.
However, we can actually construct a realistic scenario in which a representation that maximizes this objective is not sufficient.</p>
<p>For example, consider the MDP shown on the left side of Figure 4 in which an autonomous vehicle is approaching a traffic light.
The agent has two actions available, stop or go.
The reward for following traffic rules depends on the color of the stoplight, and is denoted by a red X (low reward) and green check mark (high reward).
On the right side of the figure, we show a state representation in which the color of the stoplight is not represented in the two states on the left; they are aliased and represented as a single state.
This representation is not sufficient, since from the aliased state it is not clear whether the agent should “stop” or “go” to receive the reward.
However, $J_{inv}$ is maximized because the action taken is still exactly predictable given each pair of states.
In other words, the agent has no control over the stoplight, so representing it does not increase MI.
Since $J_{inv}$ is maximized by this insufficient representation, we can conclude that the objective is not sufficient.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/inv_counterexample.png" width="70%" />
<br />
<i><b>Figure 4. </b>Counterexample proving the insufficiency of $J_{inv}$.</i>
</p>
<p>Since the reward depends on the stoplight, perhaps we can remedy the issue by additionally requiring the representation to be capable of predicting the immediate reward at each state.
However, this is still not enough to guarantee sufficiency - the representation on the right side of Figure 4 is still a counterexample since the aliased states have the same reward.
The crux of the problem is that representing the action that connects two states is not enough to be able to choose the best action.
Still, while $J_{inv}$ is insufficient in the general case, it would be revealing to characterize the set of MDPs for which $J_{inv}$ can be proven to be sufficient.
We see this as an interesting future direction.</p>
<h3 id="representations-learned-by-maximizing-state-information">Representations Learned by Maximizing “State Information”</h3>
<p>The final objective we consider resembles $J_{fwd}$ but omits the action: $J_{state} = I(Z_t; Z_{t+1})$ (see <a href="https://arxiv.org/abs/1807.03748">Oord et al. 2018</a>, <a href="https://arxiv.org/abs/1906.08226">Anand et al. 2019</a>, <a href="https://arxiv.org/abs/2009.08319">Stooke et al. 2020</a>).
Does omitting the action from the MI objective impact its sufficiency?
It turns out the answer is yes.
The intuition is that maximizing this objective can yield insufficient representations that alias states whose transition distributions differ only with respect to the action.
For example, consider a scenario of a car navigating to a city, depicted below in Figure 5.
There are four states from which the car can take actions “turn right” or “turn left.”
The optimal policy takes first a left turn, then a right turn, or vice versa.
Now consider the state representation shown on the right that aliases $s_2$ and $s_3$ into a single state we’ll call $z$.
If we assume the policy distribution is uniform over left and right turns (a reasonable scenario for a driving dataset collected with an exploration policy), then this representation maximizes $J_{state}$.
However, it can’t represent the optimal policy because the agent doesn’t know whether to go right or left from $z$.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/state_counterexample.png" width="60%" />
<br />
<i><b>Figure 5. </b>Counterexample proving the insufficiency of $J_{state}$.</i>
</p>
<h2 id="can-sufficiency-matter-in-deep-rl">Can Sufficiency Matter in Deep RL?</h2>
<p>To understand whether the sufficiency of state representations can matter in practice, we perform simple proof-of-concept experiments with deep RL agents and image observations. To separate representation learning from RL, we first optimize each representation learning objective on a dataset of offline data, (similar to the protocol in <a href="https://arxiv.org/abs/2009.08319">Stooke et al. 2020</a>). We collect the fixed datasets using a random policy, which is sufficient to cover the state space in our environments. We then freeze the weights of the state encoder learned in the first phase and train RL agents with the representation as state input (see Figure 6).</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/experiment_setup.png" width="60%" />
<br />
<i><b>Figure 6. </b>Experimental setup for evaluating learned representations.</i>
</p>
<p>We experiment with a simple video game MDP that has a similar characteristic to the self-driving car example described earlier. In this game called <em>catcher</em>, from the <a href="https://pygame.org">PyGame suite</a>, the agent controls a paddle that it can move back and forth to catch fruit that falls from the top of the screen (see Figure 7). A positive reward is given when the fruit is caught and a negative reward when the fruit is not caught. The episode terminates after one piece of fruit falls. Analogous to the self-driving example, the agent does not control the position of the fruit, and so a representation that maximizes $J_{inv}$ might discard that information. However, representing the fruit is crucial to obtaining reward, since the agent must move the paddle underneath the fruit to catch it. We learn representations with $J_{inv}$ and $J_{fwd}$, optimizing $J_{fwd}$ with noise contrastive estimation (<a href="https://arxiv.org/abs/1804.10689">NCE</a>), and $J_{inv}$ by training an inverse model via maximum likelihood. (For brevity, we omit experiments with $J_{state}$ in this post – please see the paper!) To select the most compressed representation from among those that maximize each objective, we apply an information bottleneck of the form $\min I(Z; S)$. We also compare to running RL from scratch with the image inputs, which we call ``end-to-end.” For the RL algorithm, we use the <a href="https://arxiv.org/abs/1801.01290">Soft Actor-Critic</a> algorithm.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/catcher.gif" width="27%" />
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/catcher_plot.png" width="32%" />
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/catcher_table_legend.png" width="33%" />
<br />
<i><b>Figure 7. </b>(left) Depiction of the catcher game. (middle) Performance of RL agents trained with different state representations. (right) Accuracy of reconstructing ground truth state elements from learned representations.</i>
</p>
<p>We observe in Figure 7 (middle) that indeed the representation trained to maximize $J_{inv}$ results in RL agents that converge slower and to a lower asymptotic expected return. To better understand what information the representation contains, we then attempt to learn a neural network decoder from the learned representation to the position of the falling fruit. We report the mean error achieved by each representation in Figure 7 (right). The representation learned by $J_{inv}$ incurs a high error, indicating that the fruit is not precisely captured by the representation, while the representation learned by $J_{fwd}$ incurs low error.</p>
<h3 id="increasing-observation-complexity-with-visual-distractors">Increasing observation complexity with visual distractors</h3>
<p>To make the representation learning problem more challenging, we repeat this experiment with visual distractors added to the agent’s observations. We randomly generate images of 10 circles of different colors and replace the background of the game with these images (see Figure 8, left, for example observations). As in the previous experiment, we plot the performance of an RL agent trained with the frozen representation as input (Figure 8, middle), as well as the error of decoding true state elements from the representation (Figure 8, right). The difference in performance between sufficient ($J_{fwd}$) and insufficient ($J_{inv}$) objectives is even more pronounced in this setting than in the plain background setting. With more information present in the observation in the form of the distractors, insufficient objectives that do not optimize for representing all the required state information may be “distracted” by representing the background objects instead, resulting in low performance. In this more challenging case, end-to-end RL from images fails to make any progress on the task, demonstrating the difficulty of end-to-end RL.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/distractor_observation.png" width="32%" />
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/catcher_distractor_plot.png" width="30%" />
<img src="https://bair.berkeley.edu/static/blog/mi_sufficiency_analysis/catcher_distractor_table_legend.png" width="32%" />
<br />
<i><b>Figure 8. </b>(left) Example agent observations with distractors. (middle) Performance of RL agents trained with different state representations. (right) Accuracy of reconstructing ground truth state elements from state representations.</i>
</p>
<h2 id="conclusion">Conclusion</h2>
<p>These results highlight an important open problem: how can we design representation learning objectives that yield representations that are both as lossy as possible and still sufficient for the tasks at hand?
Without further assumptions on the MDP structure or knowledge of the reward function, is it possible to design an objective that yields sufficient representations that are lossier than those learned by $J_{fwd}$?
Can we characterize the set of MDPs for which insufficient objectives $J_{inv}$ and $J_{state}$ would be sufficient?
Further, extending the proposed framework to partially observed problems would be more reflective of realistic applications. In this setting, analyzing generative models such as VAEs in terms of sufficiency is an interesting problem. Prior work has shown that maximizing the ELBO alone cannot control the content of the learned representation (e.g., <a href="https://arxiv.org/abs/1711.00464">Alemi et al. 2018</a>). We conjecture that the zero-distortion maximizer of the ELBO would be sufficient, while other solutions need not be. Overall, we hope that our proposed framework can drive research in designing better algorithms for unsupervised representation learning for RL.</p>
<hr />
<p><i>This post is based on the paper <a href="https://arxiv.org/abs/2106.07278">Which Mutual Information Representation Learning Objectives are Sufficient for Control?</a>, to be presented at Neurips 2021. Thank you to Sergey Levine and Abhishek Gupta for their valuable feedback on this blog post.</i></p>
Fri, 19 Nov 2021 01:00:00 -0800
http://bair.berkeley.edu/blog/2021/11/19/mi-sufficiency-analysis/
http://bair.berkeley.edu/blog/2021/11/19/mi-sufficiency-analysis/Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets<!--
These are comments in HTML. The above header text is needed to format the
title, authors, etc. The "example_post" is an example representative image (not
GIF) that we use for each post for tweeting (see below as well) and for the
emails to subscribers. Please provide this image (and any other images and
GIFs) in the blog to the BAIR Blog editors directly.
The text directly below gets tweets to work. Please adjust according to your
post.
The `static/blog` directory is a location on the blog server which permanently
stores the images/GIFs in BAIR Blog posts. Each post has a subdirectory under
this for its images (titled `example_post` here, please change).
Keeping the post visbility as False will mean the post is only accessible if
you know the exact URL.
You can also turn on Disqus comments, but we recommend disabling this feature.
-->
<!-- twitter -->
<meta name="twitter:title" content="Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/bridge_data/header_blog_post.png" />
<meta name="keywords" content="large-scale robot learning, transfer learning" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Frederik Ebert, Yanlai Yang" />
<!--
The actual text for the post content appears below. Text will appear on the
homepage, i.e., https://bair.berkeley.edu/blog/ but we only show part of the
posts on the homepage. The rest is accessed via clicking 'Continue'. This is
enforced with the `more` excerpt separator.
-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/montage_small%20copy.gif" width="100%" />
<br />
<i>Fig. 1: The BRIDGE dataset contains 7200 demonstrations of kitchen-themed manipulation tasks across 71 tasks in 10 domains. Note that any GIF compression artifacts in this animation are not present in the dataset itself.</i>
</p>
<p>When we apply robot learning methods to real-world systems, we must usually collect new datasets for every task, every robot, and every environment. This is not only costly and time-consuming, but it also limits the size of the datasets that we can use, and this, in turn, limits generalization: if we train a robot to clean one plate in one kitchen, it is unlikely to succeed at cleaning any plate in any kitchen. In other fields, such as computer vision (e.g., <a href="https://www.image-net.org/">ImageNet</a>) and natural language processing (e.g., <a href="https://arxiv.org/abs/1810.04805">BERT</a>), the standard approach to generalization is to utilize large, diverse datasets, which are collected once and then reused repeatedly. Since the dataset is reused for many models, tasks, and domains, the up-front cost of collecting such large reusable datasets is worth the benefits. Thus, to obtain truly generalizable robotic behaviors, we may need large and diverse datasets, and the only way to make this practical is to reuse data across many different tasks, environments, and labs (i.e. different background lighting conditions, etc.).</p>
<!--more-->
<p>Each end-user of such a dataset might want their robot to learn a different task, which would be situated in a different domain (e.g., a different laboratory, home, etc.). Therefore, any reusable dataset would need to cover a sufficient variety of tasks and environments to allow the learning algorithm to extract generalizable, reusable features. To this end, we collected a dataset of 7200 demonstrations for 71 different kitchen-themed tasks, collected in 10 different environments (see the illustration in Figure 1). We refer to this dataset as the BRIDGE dataset (Broad Robot Interaction Dataset for boosting GEneralization)</p>
<p>To study how this dataset can be reused for multiple problems, we take a simple multi-task imitation learning approach to train vision-based control policies on our diverse multi-task, multi-domain dataset. Our experiments show that by reusing the BRIDGE dataset, we can enable a robot in a new scene or environment (which was not seen in the bridge data) to more effectively generalize when learning a new task (which was also not seen in the bridge data), as well as to transfer tasks from the bridge data to the target domain. Since we use a low-cost robotic arm, the setup can readily be reproduced by other researchers who can use our bridge dataset to boost the performance of their own robot policies.</p>
<p>With the proposed dataset and multi-task, multi-domain learning approach, we have shown one potential avenue for making diverse datasets reusable in robotics, opening up this area for more sophisticated techniques as well as providing the confidence that scaling up this approach could lead to even greater generalization benefits.</p>
<h1 id="bridge-dataset-specifics">BRIDGE Dataset Specifics</h1>
<p>Compared to existing datasets, including <a href="https://arxiv.org/abs/1802.01557">DAML</a>, <a href="https://www.ri.cmu.edu/publications/multiple-interactions-made-easy-mime-large-scale-demonstrations-data-for-imitation/">MIME</a>, <a href="https://arxiv.org/abs/1910.11215">Robonet</a>, <a href="https://arxiv.org/abs/1811.02790">RoboTurk</a>, and <a href="https://dhiraj100892.github.io/Visual-Imitation-Made-Easy/">Visual Imitation Made Easy</a>, which mainly focus on a single scene or environment, our dataset features multiple domains and a large number of diverse, semantically meaningful tasks with expert trajectories, making it well suited for imitation learning and transfer learning on new domains.</p>
<p>The environments in the bridge dataset are mostly kitchen and sink playsets for children, since they are comparatively robust and low-cost, while still providing settings that resemble typical household scenes. The dataset was collected with 3-5 concurrent viewpoints to provide a form of data augmentation and study generalization to new viewpoints. Each task has between 50 and 300 demonstrations. To prevent algorithms from overfitting to certain positions, during data collection, we randomize the kitchen position, the camera positions, and the positions of distractor objects every 5-25 trajectories.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/System_Overview.resized.png" width="80%" />
<br />
<i>Fig 2: Demonstration data collection setup using VR Headset.</i>
</p>
<p>We collect our dataset with the 6-dof WidowX250s robot due to its accessibility and affordability, though we welcome contributions of data with different robots. The total cost of the setup is less than US$3600 (excluding the computer). To collect demonstrations, we use an Oculus Quest headset, where we put the headset on a table (as illustrated in Figure 2) next to the robot and track the user’s handset while applying the user’s motions to the robot end-effector via inverse kinematics. This gives the user an intuitive method for controlling the arm in 6 degrees of freedom.</p>
<p>Instructions for how users can reproduce our setup and collect data in new environments can be found on the <a href="https://sites.google.com/view/bridgedata">project website</a>.</p>
<p>Transfer with Multi-Task Imitation Learning
While a variety of transfer learning methods have been proposed in the literature for combining datasets from distinct domains, we find that a simple joint training approach is effective for deriving considerable benefit from bridge data. We combine the bridge dataset with user-provided demonstrations in the target domain. Since the sizes of these datasets are significantly different, we rebalance the datasets (for more details see the paper). Imitation learning then proceeds normally, simply training the policy with supervised learning on the combined dataset.</p>
<p>Boosting Generalization via Bridge Datasets
We consider three types of generalization in our experiments:</p>
<h1 id="transfer-with-matching-behaviors">Transfer with matching behaviors</h1>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/matching_behaviors.resized.png" width="80%" />
<br />
<i>Figure 4: Scenario 1, Transfer with matching behaviors: Here, the user collects a small number of demonstrations in the target domain for a task that is also present in the bridge data.</i>
</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/matching_behavior_results.png" width="80%" />
<br />
<i>Figure 5: Experiment results for transfer with matching behaviors. Jointly training with the bridge data greatly improves generalization performance.</i>
</p>
<p>In this scenario (depicted in Figure 4), the user collects some small amount of data in their target domain for tasks that are also present in the bridge data (e.g., around 50 demos per task) and uses the bridge data to boost the performance and generalization of these tasks. This scenario is the most conventional and resembles domain adaptation in computer vision, but it is also the most limiting since it requires the desired tasks to be present in the bridge data and the user to collect additional data of the same task.</p>
<p>Figure 5 shows results for the transfer learning with matching behaviors scenario. For comparison, we include the performance of the policy when trained only on the target domain data, without bridge data (Target Domain Only), a baseline that uses only the bridge data without any target domain data (Direct Transfer), as well as a baseline that trains a single-task policy on data in the target domain only (Single Task). As can be seen in the results, jointly training with the bridge data leads to significant gains in performance (66% success averaged over tasks) compared to the direct transfer (14% success), target domain only (28% success), and the single task (18% success) baseline. This is not surprising since this scenario directly augments the training set with additional data of the same tasks, but it still provides a validation of the value of including bridge data in training.</p>
<h1 id="zero-shot-transfer-with-target-support">Zero-shot transfer with target support</h1>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/transfer_with_target_support.resized.png" width="80%" />
<br />
<i>Figure 6: Scenario 2, Zero-shot transfer with target support: After collecting data for a small number of tasks (10 in our case) in the target domain, the user is able to transfer other tasks from the bridge dataset to the target domain.</i>
</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/transfer_results.png" width="80%" />
<br />
<i>Figure 7: Experiment results for zero-shot transfer with target support: Joint bridge-target imitation, which is trained with bridge data and data from 10 target domain tasks, allows transferring tasks to the target domain with significantly higher success rates (blue) than directly transferring tasks (without any target domain data), called direct transfer (orange).</i>
</p>
<p>In this scenario (depicted in Figure 6), the user utilizes data from a few tasks in their target domain to “import” other tasks that are present in the bridge data without additionally collecting new demonstrations for them in the target domain. For example, the bridge data contains the tasks of putting a sweet potato into a pot or a pan, the user provides data in their domain for putting brushes in pans, and the robot is then able to both put brushes as well as put sweet potatoes in pans. This scenario increases the repertoires of skills that are available in the user’s target environment simply by including the bridge data, thus eliminating the need to recollect data for every task in every target environment.</p>
<p>Figure 7 shows the experiment results for this scenario. Since there is no target domain data for these tasks, we cannot compare to a baseline that does not use bridge data at all since such a baseline would have no data for these tasks. However, we do include the “direct transfer” baseline, which utilizes a policy trained only on the bridge data. The results indicate that the jointly trained policy, which obtains 44% success averaged over tasks indeed attains a very significant increase in performance over direct transfer (30% success), suggesting that the zero-shot transfer with target support scenario offers a viable way for users to “import” tasks from the bridge dataset into their domain.</p>
<h1 id="boosting-generalization-of-new-tasks">Boosting generalization of new tasks</h1>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/novel_task_with_bridge_data_support.resized.png" width="80%" />
<br />
<i>Figure 8:Scenario 3, Boosting generalization of new tasks: Jointly training with bridge data and a new task in a new scene or environment (that is not present in the bridge data) enables significantly higher success rates than training on the target domain data from scratch.</i>
</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/new_task_results.png" width="80%" />
<br />
<i>Figure 9: Experiment results for boosting generalization of new tasks: Jointly training with bridge data (blue) on average leads to a 2x gain in generalization performance compared to only training on target domain data (red).</i>
</p>
<p>In this scenario (depicted in Figure 8), the user provides a small amount of data (50 demonstrations in practice) for a new task that is not present in the bridge data and then utilizes the bridge data to boost the generalization and performance of this task. This scenario most directly reflects our primary goals since it uses the bridge data without requiring either the domains or tasks to match, leveraging the diversity of the data and structural similarity to boost performance and generalization of entirely new tasks.</p>
<p>To enable this kind of generalization boosting, we conjecture that the key features that bridge datasets must have are: (i) a sufficient variety of settings, so as to provide for good generalization; (ii) shared structure between bridge data domains and target domains (i.e., it is unreasonable to expect generalization for a construction robot using bridge data of kitchen tasks); (iii) a sufficient range of tasks that breaks unwanted correlations between tasks and domains.</p>
<p>The experiment results are presented in Figure 9, which show that training jointly with the bridge data leads to significant improvement on 6 out of 10 tasks across three evaluation environments, leading to 50% success averaged over tasks, whereas single task policies attain around 22% success – a 2x improvement in overall performance (the asterisks denote in which experiments the objects are not contained in the bridge data). The significant improvements obtained from including the bridge data suggest that bridge datasets can be a powerful vehicle for boosting the generalization of new skills and that a single shared bridge dataset can be utilized across a range of domains and applications.</p>
<p>In Figure 10 we show example rollouts for each of the three transfer scenarios.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/bridge_data/matching.gif" width="33%" />
<img src="https://bair.berkeley.edu/static/blog/bridge_data/target.gif" width="33%" />
<img src="https://bair.berkeley.edu/static/blog/bridge_data/novel.gif" width="33%" />
<br />
<i>Figure 10: Example rollouts of policies jointly trained on target domain data and bridge data in each of the three transfer scenarios. <br />
Left: transfer with matching behaviors, scenario 1, put pot in sink; <br />
Middle: zero-shot transfer with target support, scenario 2, put carrot on plate; <br />
Right: boosting generalization of new tasks, scenario 3, wipe plate with sponge <br />
</i>
</p>
<h1 id="conclusions">Conclusions</h1>
<p>We showed how a large, diverse bridge dataset can be leveraged in three different ways to improve generalization in robotic learning. Our experiments demonstrate that including bridge data when training skills in a new domain can improve performance across a range of scenarios, both for tasks that are present in the bridge data and, perhaps surprisingly, entirely new tasks. This means that bridge data may provide a generic tool to improve generalization in a user’s target domain. In addition, we showed that bridge data can also function as a tool to import tasks from the prior dataset to a target domain, thus increasing the repertoires of skills a user has at their disposal in a particular target domain. This suggests that a large, shared bridge dataset, like the one we have released, could be used by different robotics researchers to boost the generalization capabilities and the number of available skills of their imitation-trained policies.</p>
<p>We hope that by releasing our dataset to the community, we can take a step toward generalizing robotic learning and make it possible for anyone to train robotic policies that quickly generalize to varied environments without repeatedly collecting large and exhaustive datasets.</p>
<p>We encourage interested researchers to visit our <a href="https://sites.google.com/view/bridgedata">project website</a> for more information and instructions for how to contribute to our dataset.</p>
<p>Please find the corresponding paper on arxiv.
We thank Chelsea Finn and Sergey Levine for helpful feedback on the blog post.</p>
<hr />
<p>This post is based on the following paper:</p>
<p><strong>Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets</strong> <br /></p>
<p>Frederik Ebert\(^*\), Yanlai Yang\(^*\), Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, Sergey Levine <br />
<a href="https://arxiv.org/abs/2109.13396">paper</a>, <a href="https://sites.google.com/view/bridgedata">project website</a></p>
Thu, 18 Nov 2021 01:00:00 -0800
http://bair.berkeley.edu/blog/2021/11/18/bridge-data/
http://bair.berkeley.edu/blog/2021/11/18/bridge-data/Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability<!-- twitter -->
<meta name="twitter:title" content="Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/epistemic_pomdp/epistemic_pomdp/blog_figs.teaser.gif" />
<meta name="keywords" content="reinforcement learning, generalization, deep RL" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Dibya Ghosh" />
<!--
The actual text for the post content appears below. Text will appear on the
homepage, i.e., https://bair.berkeley.edu/blog/ but we only show part of the
posts on the homepage. The rest is accessed via clicking 'Continue'. This is
enforced with the `more` excerpt separator.
-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/epistemic_pomdp/blog_figs.teaser.gif" width="90%" />
</p>
<p>Many experimental works have observed that generalization in deep RL appears to be difficult: although RL agents can learn to perform very complex tasks, they don’t seem to generalize over diverse task distributions as well as the excellent generalization of supervised deep nets might lead us to expect. In this blog post, we will aim to explain why generalization in RL is fundamentally harder, and indeed more difficult even in theory.</p>
<p>We will show that attempting to generalize in RL induces implicit partial observability, even when the RL problem we are trying to solve is a standard fully-observed MDP. This induced partial observability can significantly complicate the types of policies needed to generalize well, potentially requiring counterintuitive strategies like information-gathering actions, recurrent non-Markovian behavior, or randomized strategies. Ordinarily, this is not necessary in fully observed MDPs but surprisingly becomes necessary when we consider generalization from a finite training set in a fully observed MDP. This blog post will walk through why partial observability can implicitly arise, what it means for the generalization performance of RL algorithms, and how methods can account for partial observability to generalize well.
<!--more--></p>
<h2 id="learning-by-example">Learning By Example</h2>
<p>Before formally analyzing generalization in RL, let’s begin by walking through two examples that illustrate what can make generalizing well in RL problems difficult.</p>
<p><strong>The Image Guessing Game:</strong> In this game, an RL agent is shown an image each episode, and must guess its label as quickly as possible (Figure 1). Each timestep, the agent makes a guess; if the agent is correct, then the episode ends, but if incorrect, the agent receives a negative reward, and must make another guess <em>for the same image</em> at the next timestep. Since each image has a unique label (that is, there is some “true” labelling function $f_{true}: x \mapsto y$) and the agent receives the image as observation, this is a <em>fully-observable</em> RL environment.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.001.png" width="90%" />
</p>
<p><small>
<i><b>Fig 1.</b> The image guessing game, which requires an agent to repeatedly guess labels for an image until it gets it correct. RL learns policies that guess the same label repeatedly, a strategy that generalizes poorly to test images (bottom row, right). </i>
</small></p>
<p>Suppose we had access to an infinite number of training images, and learned a policy using a standard RL algorithm. This policy will learn to deterministically predict the true label ($y := f_{true}(x)$), since this is the highest return strategy in the MDP (as a sanity check, recall that the optimal policy in an MDP is deterministic and memoryless). If we only have a <em>limited</em> set of training images, an RL algorithm will still learn the same strategy, deterministically predicting the label it believes matches the image. But, does this policy generalize well? On an unseen test image, if the agent’s predicted label is correct, the highest possible reward is attained; if incorrect, the agent receives catastrophically low return, since it never guesses the correct label. This catastrophic failure mode is ever-present, since even though modern deep nets improve generalization and reduce the chance of misclassification, error on the <strong>test set</strong> cannot be completely reduced to 0.</p>
<p>Can we do better than this deterministic prediction strategy? Yes, since the learned RL strategy ignores two salient features of the guessing game: 1) the agent receives feedback through an episode as to whether its guesses are correct, and 2) the agent can change its guess in future timesteps. One strategy that better takes advantage of these features is process-of-elimination; first, selecting the label it considers most likely, and if incorrect, eliminating it and adapting to the next most-likely label, and so on. This type of adaptive memory-based strategy, however, can never be learned by a standard RL algorithm like Q-learning, since they optimize MDP objectives and <strong>only</strong> learn deterministic and memoryless policies.</p>
<p><strong>Maze-Solving:</strong> A staple of RL generalization benchmarks, the maze-solving problem requires an agent to navigate to a goal in a maze given a birds-eye view of the whole maze. This task is fully-observed, since the agent’s observation shows the whole maze. As a result, the optimal policy is memoryless and deterministic: taking the action that moves the agent along the shortest path to the goal. Just as in the image-guessing game, by maximizing return within the training maze layouts, an RL algorithm will learn policies akin to this “optimal” strategy – at any state, deterministically taking the action that it considers most likely to be on the shortest path to the goal.</p>
<p>This RL policy generalizes poorly, since if the learned policy ever chooses an incorrect action, like running into a wall or doubling back on its old path, it will continue to loop the same mistake and never solve the maze. This failure mode is completely avoidable, since even when the RL agent initially takes such an “incorrect” action, after attempting to follow it, the agent <em>receives information</em> (e.g. the next observation) as to whether or not this was a good action. To generalize as well as possible, an agent should <strong>adapt</strong> its chosen actions if the original actions led to unexpected outcomes , but this behavior eludes standard RL objectives.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.004.gif" width="90%" />
</p>
<p><small>
<i><b>Fig 2.</b> In the maze task, RL policies generalize poorly: when they make an error, they repeatedly make the same error, leading to failure (left). An agent that generalizes well may still make mistakes, but has the capability of adapting and recovering from these mistakes (right). This behavior is not learned by standard RL objectives for generalization.</i>
</small></p>
<h2 id="whats-going-on-rl-and-epistemic-uncertainty">What’s Going On? RL and Epistemic Uncertainty</h2>
<p>In both the guessing game and the maze task, the gap between behavior learned by standard RL algorithms and by policies that actually generalize well, seemed to arise when the agent incorrectly (or could not) identified how the dynamics of the world behave. Let’s dig deeper into this phenomenon.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.002.png" width="90%" />
</p>
<p><small>
<i><b>Fig 3.</b> The limited training dataset prevents an agent from exactly recovering the true environment. Instead, there is an implicit partial observability, as an agent does not know which amongst the set of “consistent” environments is the true environment. </i>
</small></p>
<p>When the agent is given a small training set of contexts, there are many dynamics models that match the provided training contexts, but differ on held-out contexts. These conflicting hypotheses epitomize the agent’s <strong>epistemic uncertainty</strong> from the limited training set. While uncertainty is not specific to RL, how it can be handled in RL is unique due to the sequential decision making loop. For example, the agent can actively <em>regulate</em> how much epistemic uncertainty it is exposed to, for example by choosing a policy that only visits states where the agent is highly confident about the dynamics. Even more importantly, the agent can <em>change</em> its epistemic uncertainty at evaluation time by accounting for the information that it receives through the trajectory. Suppose for an image in the guessing game, the agent is initially uncertain between the t-shirt / coat labels. If the agent guesses “t-shirt” and receives feedback that this was incorrect, the agent <em>changes its uncertainty</em> and becomes more confident about the “coat” label, meaning it should consequently adapt and guess “coat” instead.</p>
<h2 id="epistemic-pomdps-and-implicit-partial-observability">Epistemic POMDPs and <em>Implicit</em> Partial Observability</h2>
<p>Actively steering towards regions of low uncertainty or taking information-gathering actions are two of a multitude of avenues an RL agent has to handle its epistemic uncertainty. Two important questions remain unanswered: is there a “best” way to tackle uncertainty? If so, how can we describe it? From the Bayesian perspective, it turns out there is an optimal solution: generalizing optimally requires us to solve a partially observed MDP (POMDP) that is <em>implicitly created</em> from the agent’s epistemic uncertainty.</p>
<p>This POMDP, which we call the <strong>epistemic POMDP</strong>, works as follows. Recall that because the agent has only seen a limited training set, there are many possible environments that are consistent with the training contexts provided. The set of consistent environments can be encoded by a Bayesian posterior over environments $P(M \mid D)$. Each episode in the epistemic POMDP, an agent is dropped into one of these “consistent” environments $M \sim P(M \mid D)$, and asked to maximize return within it, but with the following important detail: the agent is not told which environment $M$ it was placed in.</p>
<p>This system corresponds to a POMDP (partially observed MDP), since the relevant information needed to act is only partially observable to the agent: although the state $s$ within the environment is observed, the identity of the environment $M$ that is generating these states is hidden from the agent. The epistemic POMDP provides an instantiation of the generalization problem into the Bayesian RL framework (see survey <a href="https://arxiv.org/abs/1609.04436">here</a>), which more generally studies optimal behavior under distributions over MDPs.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.003.png" width="90%" />
</p>
<p><small>
<i><b>Fig 4.</b> In the epistemic POMDP, an agent interacts with a different “consistent” environment in each episode, but does not know which one it is interacting with, leading to partial observability. To do well, an agent must employ a (potentially memory-based) strategy that works well no matter which of these environments it is placed in. </i>
</small></p>
<p>Let’s walk through an example of what the epistemic POMDP looks like. For the guessing game, the agent is uncertain about exactly how images are labelled, so each possible environment $M \sim P(M \mid D)$ corresponds to a different image labeller that is consistent with the training dataset: $f_M: X \to Y$. In the epistemic POMDP for the guessing game, each episode, an image $x$ and labeller $f_M$ are chosen at random, and the agent required to output the label that is assigned by the sampled classifier $y = f_M(x)$. The agent cannot do this directly, because the identity of the classifier $f_M$ is <em>not provided</em> to the agent, only the image $x$. If all the labellers $f_M$ in the posterior agree on the label for a certain image, the agent can just output this label (no partial observability). However, if different classifiers assign different labels, the agent must use a strategy that works well on average, regardless of which of the labellers was used to label the data (for example, by adaptive process-of-elimination guessing or randomized guessing).</p>
<p>What makes the epistemic POMDP particularly exciting is the following equivalence:</p>
<blockquote>
<p>An RL agent is <strong>Bayes-optimal for generalization</strong> if and only if it <strong>maximizes expected return in the corresponding epistemic POMDP</strong>. More generally, the performance of an agent in the epistemic POMDP dictates how well it is expected to generalize at evaluation time.</p>
</blockquote>
<p>That generalization performance is dictated by performance in the epistemic POMDP hints at a few lessons for bridging the gap between the “optimal” way to generalize in RL and current practices. For example, it is relatively well-known that the optimal policy in a POMDP is generally non-Markovian (adaptive based on history), and may take information-gathering actions to reduce the degree of partial observability. This means that to generalize optimally, we are likely to need adaptive information-gathering behaviors instead of the static Markovian policies that are usually trained.</p>
<p>The epistemic POMDP also highlights the perils of our predominant approach to learning policies from a limited training set of contexts: running a fully-observable RL algorithm on the training set. These algorithms model the environment as an MDP and learn MDP-optimal strategies, which are deterministic and Markov. These policies do not account for partial observability, and therefore tend to generalize poorly (for example, in the guessing game and maze tasks). This indicates a mismatch between the MDP-based training objectives that are standard in modern algorithms and the epistemic POMDP training objective that actually dictates how well the learned policy generalizes.</p>
<h2 id="moving-forward-with-generalization-in-rl">Moving Forward with Generalization in RL</h2>
<p>The implicit presence of partial observability at test time may explain why standard RL algorithms, which optimize fully-observed MDP objectives, fail to generalize. What should we do instead to learn RL policies that generalize better? The epistemic POMDP provides a prescriptive solution: when the agent’s posterior distribution over environments can be calculated, then constructing the epistemic POMDP and running a POMDP-solving algorithm on it will yield policies that generalize Bayes-optimally.</p>
<p>Unfortunately, in most interesting problems, this cannot be exactly done. Nonetheless, the epistemic POMDP can serve as a lodestar for designing RL algorithms that generalize better. As a first step, in our NeurIPS 2021 paper, we introduce an algorithm called LEEP, which uses statistical bootstrapping to learn a policy in an approximation of the epistemic POMDP. On Procgen, a challenging generalization benchmark for RL agents, LEEP improves significantly in test-time performance over PPO (Figure 3). While only a crude approximation, LEEP provides some indication that attempting to learn a policy in the epistemic POMDP can be a fruitful avenue for developing more generalizable RL algorithms.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.005.png" width="90%" />
</p>
<p><small>
<i><b>Fig 5.</b> LEEP, an algorithm based on the epistemic POMDP objective, generalizes better than PPO in four Procgen tasks. </i>
</small></p>
<hr />
<h2 id="if-you-take-one-lesson-from-this-blog-post">If you take one lesson from this blog post…</h2>
<p>In supervised learning, optimizing for performance on the training set translates to good generalization performance, and it is tempting to suppose that generalization in RL can be solved in the same manner. This is surprisingly <strong>not true</strong>; limited training data in RL introduces <em>implicit partial observability</em> into an otherwise fully-observable problem. This implicit partial observability, as formalized by <em>the epistemic POMDP</em>, means that generalizing well in RL necessitates adaptive or stochastic behaviors, hallmarks of POMDP problems.</p>
<p>Ultimately, this highlights the incompatibility that afflicts generalization of our deep RL algorithms: with limited training data, our MDP-based RL objectives are misaligned with the implicit POMDP objective that ultimately dictates generalization performance.</p>
<p><small><em>This post is based on <a href="https://arxiv.org/abs/2107.06277">the paper</a> “Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability,” which is joint work with Jad Rahme (equal contribution), Aviral Kumar, Amy Zhang, Ryan P. Adams, and Sergey Levine. Thanks to Sergey Levine and Katie Kang for helpful feedback on the blog post.</em>
</small></p>
Fri, 05 Nov 2021 02:00:00 -0700
http://bair.berkeley.edu/blog/2021/11/05/epistemic-pomdp/
http://bair.berkeley.edu/blog/2021/11/05/epistemic-pomdp/RECON: Learning to Explore the Real World with a Ground Robot<!--
These are comments in HTML. The above header text is needed to format the
title, authors, etc. The "example_post" is an example representative image (not
GIF) that we use for each post for tweeting (see below as well) and for the
emails to subscribers. Please provide this image (and any other images and
GIFs) in the blog to the BAIR Blog editors directly.
The text directly below gets tweets to work. Please adjust according to your
post.
The `static/blog` directory is a location on the blog server which permanently
stores the images/GIFs in BAIR Blog posts. Each post has a subdirectory under
this for its images (titled `example_post` here, please change).
Keeping the post visbility as False will mean the post is only accessible if
you know the exact URL.
You can also turn on Disqus comments, but we recommend disabling this feature.
-->
<!-- twitter -->
<meta name="twitter:title" content="RECON: Learning to Explore the Real World with a Ground Robot" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/recon/recon_blog_overview_overhead.png" />
<meta name="keywords" content="Robotics, Navigation, Reinforcement Learning, Imitation Learning, Exploration" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Dhruv Shah" />
<!--
The actual text for the post content appears below. Text will appear on the
homepage, i.e., https://bair.berkeley.edu/blog/ but we only show part of the
posts on the homepage. The rest is accessed via clicking 'Continue'. This is
enforced with the `more` excerpt separator.
-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_teaser.gif" alt="RECON Exploration Teaser" width="100%" /><br />
<i>
An example of our method deployed on a Clearpath Jackal ground robot (left) exploring a suburban environment to find a visual target (inset). (Right) Egocentric observations of the robot.
</i>
</p>
<p>Imagine you’re in an unfamiliar neighborhood with no house numbers and I give you a photo that I took a few days ago of my house, which is not too far away. If you tried to find my house, you might follow the streets and go around the block looking for it. You might take a few wrong turns at first, but eventually you would locate my house. In the process, you would end up with a mental map of my neighborhood. The next time you’re visiting, you will likely be able to navigate to my house right away, without taking any wrong turns.</p>
<p>Such exploration and navigation behavior is easy for humans. What would it take for a robotic learning algorithm to enable this kind of intuitive navigation capability? To build a robot capable of exploring and navigating like this, we need to learn from diverse prior datasets in the real world. While it’s possible to collect a large amount of data from demonstrations, or even with randomized exploration, learning meaningful exploration and navigation behavior from this data can be challenging – the robot needs to generalize to unseen neighborhoods, recognize visual and dynamical similarities across scenes, and learn a representation of visual observations that is robust to distractors like weather conditions and obstacles. Since such factors can be hard to model and transfer from simulated environments, we tackle these problems by teaching the robot to explore using only real-world data.</p>
<!--more-->
<p>Formally, we studied the problem of <em>goal-directed</em> exploration for <em>visual</em> navigation in <em>novel</em> environments. A robot is tasked with navigating to a goal location \(G\), specified by an image \(o_G\) taken at \(G\). Our method uses an offline dataset of trajectories, over 40 hours of interactions in the real-world, to learn navigational affordances and builds a compressed representation of perceptual inputs. We deploy our method on a mobile robotic system in industrial and recreational outdoor areas around the city of Berkeley. RECON can discover a new goal in a previously unexplored environment in under 10 minutes, and in the process build a “mental map” of that environment that allows it to then reach goals again in just 20 seconds. Additionally, we make this real-world offline dataset publicly available for use in future research.</p>
<h1 id="rapid-exploration-controllers-for-outcome-driven-navigation">Rapid Exploration Controllers for Outcome-driven Navigation</h1>
<p>RECON, or <strong>R</strong>apid <strong>E</strong>xploration <strong>C</strong>ontrollers for <strong>O</strong>utcome-driven <strong>N</strong>avigation, explores new environments by “imagining” potential goal images and attempting to reach them. This exploration allows RECON to incrementally gather information about the new environment.</p>
<p>Our method consists of two components that enable it to explore new environments. The first component is a learned representation of goals. This representation ignores task-irrelevant distractors, allowing the agent to quickly adapt to novel settings. The second component is a topological graph. Our method learns both components using datasets or real-world robot interactions gathered in prior work. Leveraging such large datasets allows our method to generalize to new environments and scale beyond the original dataset.</p>
<h2 id="learning-to-represent-goals">Learning to Represent Goals</h2>
<p>A useful strategy to learn complex goal-reaching behavior in an unsupervised manner is for an agent to set its own goals, based on its capabilities, and attempt to reach them. <a href="https://pubmed.ncbi.nlm.nih.gov/15811218/">In fact</a>, humans are very proficient at setting abstract goals for themselves in an effort to learn diverse skills. <a href="https://arxiv.org/abs/1807.04742">Recent progress</a> in reinforcement learning and robotics has also shown that teaching agents to set its own goals by “imagining” them can result in learning of impressive unsupervised goal-reaching skills. To be able to “imagine”, or sample, such goals, we need to build a prior distribution over the goals seen during training.</p>
<p>For our case, where goals are represented by high-dimensional images, how should we sample goals for exploration? Instead of explicitly sampling goal images, we instead have the agent learn a compact representation of latent goals, allowing us to perform exploration by sampling new latent goal <em>representations</em>, rather than by sampling images. This representation of goals is learned from context-goal pairs previously seen by the robot. We use a <a href="https://arxiv.org/abs/1612.00410">variational information bottleneck</a> to learn these representations because it provides two important properties. First, it learns representations that throw away irrelevant information, such as lighting and pixel noise. Second, the variational information bottleneck packs the representations together so that they look like a chosen prior distribution. This is useful because we can then sample imaginary representations by sampling from this prior distribution.</p>
<p>The architecture for learning a prior distribution for these representations is shown below. As the encoder and decoder are conditioned on the context, the representation \(Z_t^g\) only encodes information about <em>relative</em> location of the goal from the context – this allows the model to represent feasible goals. If, instead, we had a typical VAE (in which the input images are autoencoded), the samples from the prior over these representations would not necessarily represent goals that are reachable from the current state. This distinction is crucial when exploring new environments, where most states from the training environments are not valid goals.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_architecture.png" alt="Architecture with a latent goal model" width="100%" /><br />
<i>
The architecture for learning a prior over goals in RECON. The context-conditioned embedding learns to represent feasible goals.
</i>
</p>
<p>To understand the importance of learning this representation, we run a simple experiment where the robot is asked to explore in an undirected manner starting from the yellow circle in the figure below. We find that sampling representations from the learned prior greatly accelerates the diversity of exploration trajectories and allows a wider area to be explored. In the absence of a prior over previously seen goals, using random actions to explore the environment can be quite inefficient. Sampling from the prior distribution and attempting to reach these “imagined” goals allows RECON to explore the environment efficiently.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_sampling.png" alt="Goal sampling with RECON" width="80%" /><br />
<i>
Sampling from a learned prior allows the robot to explore 5 times faster than using random actions.
</i>
</p>
<h2 id="goal-directed-exploration-with-a-topological-memory">Goal-Directed Exploration with a Topological Memory</h2>
<p>We combine this goal sampling scheme with a topological memory to incrementally build a “mental map” of the new environment. This map provides an estimate of the exploration <em>frontier</em> as well as guidance for subsequent exploration. In a new environment, RECON encourages the robot to explore at the frontier of the map – while the robot is not at the frontier, RECON directs it to navigate to a previously seen subgoal at the frontier of the map.</p>
<p>At the frontier, RECON uses the learned goal representation to learn a prior over goals it can reliably navigate to and are thus, <em>feasible</em> to reach. RECON uses this goal representation to sample, or “imagine”, a feasible goal that helps it explore the environment. This effectively means that, when placed in a new environment, if RECON does not know where the target is, it “imagines” a suitable subgoal that it can drive towards to explore and collects information, until it believes it can reach the target goal image. This allows RECON to “search” for the goal in an unknown environment, all the while building up its mental map. Note that the objective of the topological graph is to build a compact map of the environment and encourage the robot to reach the frontier; it does not inform goal sampling once the robot is at the frontier.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_exploration.png" alt="Illustration of the exploration algorithm" width="100%" /><br />
<i>
Illustration of the exploration algorithm of RECON.
</i>
</p>
<h2 id="learning-from-diverse-real-world-data">Learning from Diverse Real-world Data</h2>
<p>We train these models in RECON entirely using offline data collected in a diverse range of outdoor environments. Interestingly, we were able to train this model using data collected for two independent projects in the <a href="https://sites.google.com/view/badgr">fall of 2019</a> and <a href="https://sites.google.com/view/ving-robot/home">spring of 2020</a>, and were successful in deploying the model to explore novel environments and navigate to goals during late 2020 and the spring of 2021. This offline dataset of trajectories consists of over 40 hours of data, including off-road navigation, driving through parks in Berkeley and Oakland, parking lots, sidewalks and more, and is an excellent example of noisy real-world data with visual distractors like lighting, seasons (rain, twilight etc.), dynamic obstacles etc. The dataset consists of a mixture of teleoperated trajectories (2-3 hours) and open-loop safety controllers programmed to collect random data in a self-supervised manner. This dataset presents an exciting benchmark for robotic learning in real-world environments due to the challenges posed by offline learning of control, representation learning from high-dimensional visual observations, generalization to out-of-distribution environments and test-time adaptation.</p>
<p>We are releasing this dataset publicly to support future research in machine learning from real-world interaction datasets, check out the <a href="https://sites.google.com/view/recon-robot/dataset">dataset page</a> for more information.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_envs.png" alt="Sample environments from the offline dataset of trajectories" width="100%" /><br />
<i>
We train from diverse offline data (top) and test in new environments (bottom).
</i>
</p>
<h2 id="recon-in-action">RECON in Action</h2>
<p>Putting these components together, let’s see how RECON performs when deployed in a park near Berkeley. Note that the robot has never seen images from this park before. We placed the robot in a corner of the park and provided a target image of a white cabin door. In the animation below, we see RECON exploring and successfully finding the desired goal. “Run 1” corresponds to the exploration process in a novel environment, guided by a user-specified target image on the left. After it finds the goal, RECON uses the mental map to distill its experience in the environment to find the shortest path for subsequent traversals. In “Run 2”, RECON follows this path to navigate directly to the goal without looking around.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_overview.gif" alt="Animation showing RECON deployed in a novel environment" width="100%" /><br />
<i>
In “Run 1”, RECON explores a new environment and builds a topological mental map. In “Run 2”, it uses this mental map to quickly navigate to a user-specified goal in the environment.
</i>
</p>
<p>An illustration of this two-step process from an overhead view is show below, showing the paths taken by the robot in subsequent traversals of the environment:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_overview_overhead.png" alt="Overhead view of the exploration experiment above" width="90%" /><br />
<i>
(Left) The goal specified by the user. (Right) The path taken by the robot when exploring for the first time (shown in cyan) to build a mental map with nodes (shown in white), and the path it takes when revisiting the same goal using the mental map (shown in red).
</i>
</p>
<h1 id="deploying-in-novel-environments">Deploying in Novel Environments</h1>
<p>To evaluate the performance of RECON in novel environments, study its behavior under a range of perturbations and understand the contributions of its components, we run extensive real-world experiments in the hills of Berkeley and Richmond, which have a diverse terrain and a wide variety of testing environments.</p>
<p>We compare RECON to five baselines – <a href="https://arxiv.org/abs/1810.12894">RND</a>, <a href="https://arxiv.org/abs/1901.10902">InfoBot</a>, <a href="https://arxiv.org/abs/1901.10902">Active Neural SLAM</a>, <a href="https://arxiv.org/abs/1901.10902">ViNG</a> and <a href="https://arxiv.org/abs/1810.02274">Episodic Curiosity</a> – each trained on the same offline trajectory dataset as our method, and fine-tuned in the target environment with online interaction. Note that this data is collected from past environments and contains no data from the target environment. The figure below shows the trajectories taken by the different methods for one such environment.</p>
<p>We find that only RECON (and a variant) is able to successfully discover the goal in over 30 minutes of exploration, while all other baselines result in collision (see figure for an overhead visualization). We visualize successful trajectories discovered by RECON in four other environments below.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_overhead.png" alt="Overhead view comparing the different baselines in a novel environment" width="42%" />
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_trajectories.png" alt="Successful trajectories discovered by RECON in 4 different environments" width="57%" /><br />
<i>
(Left) When comparing to other baselines, only RECON is able to successfully find the goal. (Right) Trajectories to goals in four other environments discovered by RECON.
</i>
</p>
<p>Quantitatively, we observe that our method finds goals over 50% faster than the best prior method; after discovering the goal and building a topological map of the environment, it can navigate to goals in that environment over 25% faster than the best alternative method.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_exploration_barplot.png" alt="Quantitative results in novel environments" width="85%" /><br />
<i>
Quantitative results in novel environments. RECON outperforms all baselines by over 50%.
</i>
</p>
<h2 id="exploring-non-stationary-environments">Exploring Non-Stationary Environments</h2>
<p>One of the important challenges in designing real-world robotic navigation systems is handling differences between training scenarios and testing scenarios. Typically, systems are developed in well-controlled environments, but are deployed in less structured environments. Further, the environments where robots are deployed often change over time, so tuning a system to perform well on a cloudy day might degrade performance on a sunny day. RECON uses explicit representation learning in attempts to handle this sort of non-stationary dynamics.</p>
<p>Our final experiment tested how changes in the environment affected the performance of RECON. We first had RECON explore a new “junkyard” to learn to reach a blue dumpster. Then, without any more supervision or exploration, we evaluated the learned policy when presented with <em>previously unseen obstacles</em> (trash cans, traffic cones, a car) and <em>weather conditions</em> (sunny, overcast, twilight). As shown below, RECON is able to successfully navigate to the goal in these scenarios, showing that the learned representations are invariant to visual distractors that do not affect the robot’s decisions to reach the goal.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_obstacles.gif" alt="Robustness of RECON to novel obstacles" width="90%" /><br />
<img src="https://bair.berkeley.edu/static/blog/recon/recon_blog_weather.gif" alt="Robustness of RECON to variability in weather conditions" width="90%" /><br />
<i>
First-person videos of RECON successfully navigating to a “blue dumpster” in the presence of novel obstacles (above) and varying weather conditions (below).
</i>
</p>
<h1 id="whats-next">What’s Next?</h1>
<p>The problem setup studied in this paper – using past experience to accelerate learning in a new environment – is reflective of several real-world robotics scenarios. RECON provides a robust way to solve this problem by using a combination of goal sampling and topological memory.</p>
<p>A mobile robot capable of reliably exploring and visually observing real-world environments can be a great tool for a wide variety of useful applications such as search and rescue, inspecting large offices or warehouses, finding leaks in oil pipelines or making rounds at a hospital, delivering mail in suburban communities. We demonstrated simplified versions of such applications <a href="https://sites.google.com/view/ving-robot/home">in an earlier project</a>, where the robot has prior experience in the deployment environment; RECON enables these results to scale beyond the training set of environments and results in a truly open-world learning system that can adapt to novel environments on deployment.</p>
<p>We are also releasing the aforementioned offline trajectory dataset, with over XX hours of real-world interaction of a mobile ground robot in a variety of outdoor environments. We hope that this dataset can support future research in machine learning using real-world data for visual navigation applications. The dataset is also a rich source of sequential data from a multitude of sensors and can be used to test sequence prediction models including, but not limited to, video prediction, LiDAR, GPS etc. More information about the dataset can be found in the full-text article.</p>
<hr />
<p><em>This blog post is based on our paper <a href="https://arxiv.org/abs/2104.05859">Rapid Exploration for Open-World Navigation with Latent Goal Models</a>, which will be presented as an Oral Talk at the 5th Annual Conference on Robot Learning in London, UK on November 8-11, 2021. You can find more information about our results and the dataset release on <a href="https://sites.google.com/view/recon-robot">the project page</a>.</em></p>
<p><em>Big thanks to Sergey Levine and Benjamin Eysenbach for helpful comments on an earlier draft of this article.</em></p>
Wed, 03 Nov 2021 03:00:00 -0700
http://bair.berkeley.edu/blog/2021/11/03/recon/
http://bair.berkeley.edu/blog/2021/11/03/recon/Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability<!-- twitter -->
<meta name="twitter:title" content="Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/epistemic_pomdp/epistemic_pomdp/blog_figs.teaser.gif" />
<meta name="keywords" content="reinforcement learning, generalization, deep RL" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Dibya Ghosh" />
<!--
The actual text for the post content appears below. Text will appear on the
homepage, i.e., https://bair.berkeley.edu/blog/ but we only show part of the
posts on the homepage. The rest is accessed via clicking 'Continue'. This is
enforced with the `more` excerpt separator.
-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/epistemic_pomdp/blog_figs.teaser.gif" width="90%" />
</p>
<p>Many experimental works have observed that generalization in deep RL appears to be difficult: although RL agents can learn to perform very complex tasks, they don’t seem to generalize over diverse task distributions as well as the excellent generalization of supervised deep nets might lead us to expect. In this blog post, we will aim to explain why generalization in RL is fundamentally harder, and indeed more difficult even in theory.</p>
<p>We will show that attempting to generalize in RL induces implicit partial observability, even when the RL problem we are trying to solve is a standard fully-observed MDP. This induced partial observability can significantly complicate the types of policies needed to generalize well, potentially requiring counterintuitive strategies like information-gathering actions, recurrent non-Markovian behavior, or randomized strategies. Ordinarily, this is not necessary in fully observed MDPs but surprisingly becomes necessary when we consider generalization from a finite training set in a fully observed MDP. This blog post will walk through why partial observability can implicitly arise, what it means for the generalization performance of RL algorithms, and how methods can account for partial observability to generalize well.
<!--more--></p>
<h2 id="learning-by-example">Learning By Example</h2>
<p>Before formally analyzing generalization in RL, let’s begin by walking through two examples that illustrate what can make generalizing well in RL problems difficult.</p>
<p><strong>The Image Guessing Game:</strong> In this game, an RL agent is shown an image each episode, and must guess its label as quickly as possible (Figure 1). Each timestep, the agent makes a guess; if the agent is correct, then the episode ends, but if incorrect, the agent receives a negative reward, and must make another guess <em>for the same image</em> at the next timestep. Since each image has a unique label (that is, there is some “true” labelling function $f_{true}: x \mapsto y$) and the agent receives the image as observation, this is a <em>fully-observable</em> RL environment.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.001.png" width="90%" />
</p>
<p><small>
<i><b>Fig 1.</b> The image guessing game, which requires an agent to repeatedly guess labels for an image until it gets it correct. RL learns policies that guess the same label repeatedly, a strategy that generalizes poorly to test images (bottom row, right). </i>
</small></p>
<p>Suppose we had access to an infinite number of training images, and learned a policy using a standard RL algorithm. This policy will learn to deterministically predict the true label ($y := f_{true}(x)$), since this is the highest return strategy in the MDP (as a sanity check, recall that the optimal policy in an MDP is deterministic and memoryless). If we only have a <em>limited</em> set of training images, an RL algorithm will still learn the same strategy, deterministically predicting the label it believes matches the image. But, does this policy generalize well? On an unseen test image, if the agent’s predicted label is correct, the highest possible reward is attained; if incorrect, the agent receives catastrophically low return, since it never guesses the correct label. This catastrophic failure mode is ever-present, since even though modern deep nets improve generalization and reduce the chance of misclassification, error on the <strong>test set</strong> cannot be completely reduced to 0.</p>
<p>Can we do better than this deterministic prediction strategy? Yes, since the learned RL strategy ignores two salient features of the guessing game: 1) the agent receives feedback through an episode as to whether its guesses are correct, and 2) the agent can change its guess in future timesteps. One strategy that better takes advantage of these features is process-of-elimination; first, selecting the label it considers most likely, and if incorrect, eliminating it and adapting to the next most-likely label, and so on. This type of adaptive memory-based strategy, however, can never be learned by a standard RL algorithm like Q-learning, since they optimize MDP objectives and <strong>only</strong> learn deterministic and memoryless policies.</p>
<p><strong>Maze-Solving:</strong> A staple of RL generalization benchmarks, the maze-solving problem requires an agent to navigate to a goal in a maze given a birds-eye view of the whole maze. This task is fully-observed, since the agent’s observation shows the whole maze. As a result, the optimal policy is memoryless and deterministic: taking the action that moves the agent along the shortest path to the goal. Just as in the image-guessing game, by maximizing return within the training maze layouts, an RL algorithm will learn policies akin to this “optimal” strategy – at any state, deterministically taking the action that it considers most likely to be on the shortest path to the goal.</p>
<p>This RL policy generalizes poorly, since if the learned policy ever chooses an incorrect action, like running into a wall or doubling back on its old path, it will continue to loop the same mistake and never solve the maze. This failure mode is completely avoidable, since even when the RL agent initially takes such an “incorrect” action, after attempting to follow it, the agent <em>receives information</em> (e.g. the next observation) as to whether or not this was a good action. To generalize as well as possible, an agent should <strong>adapt</strong> its chosen actions if the original actions led to unexpected outcomes , but this behavior eludes standard RL objectives.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.004.gif" width="90%" />
</p>
<p><small>
<i><b>Fig 2.</b> In the maze task, RL policies generalize poorly: when they make an error, they repeatedly make the same error, leading to failure (left). An agent that generalizes well may still make mistakes, but has the capability of adapting and recovering from these mistakes (right). This behavior is not learned by standard RL objectives for generalization.</i>
</small></p>
<h2 id="whats-going-on-rl-and-epistemic-uncertainty">What’s Going On? RL and Epistemic Uncertainty</h2>
<p>In both the guessing game and the maze task, the gap between behavior learned by standard RL algorithms and by policies that actually generalize well, seemed to arise when the agent incorrectly (or could not) identified how the dynamics of the world behave. Let’s dig deeper into this phenomenon.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.002.png" width="90%" />
</p>
<p><small>
<i><b>Fig 3.</b> The limited training dataset prevents an agent from exactly recovering the true environment. Instead, there is an implicit partial observability, as an agent does not know which amongst the set of “consistent” environments is the true environment. </i>
</small></p>
<p>When the agent is given a small training set of contexts, there are many dynamics models that match the provided training contexts, but differ on held-out contexts. These conflicting hypotheses epitomize the agent’s <strong>epistemic uncertainty</strong> from the limited training set. While uncertainty is not specific to RL, how it can be handled in RL is unique due to the sequential decision making loop. For example, the agent can actively <em>regulate</em> how much epistemic uncertainty it is exposed to, for example by choosing a policy that only visits states where the agent is highly confident about the dynamics. Even more importantly, the agent can <em>change</em> its epistemic uncertainty at evaluation time by accounting for the information that it receives through the trajectory. Suppose for an image in the guessing game, the agent is initially uncertain between the t-shirt / coat labels. If the agent guesses “t-shirt” and receives feedback that this was incorrect, the agent <em>changes its uncertainty</em> and becomes more confident about the “coat” label, meaning it should consequently adapt and guess “coat” instead.</p>
<h2 id="epistemic-pomdps-and-implicit-partial-observability">Epistemic POMDPs and <em>Implicit</em> Partial Observability</h2>
<p>Actively steering towards regions of low uncertainty or taking information-gathering actions are two of a multitude of avenues an RL agent has to handle its epistemic uncertainty. Two important questions remain unanswered: is there a “best” way to tackle uncertainty? If so, how can we describe it? From the Bayesian perspective, it turns out there is an optimal solution: generalizing optimally requires us to solve a partially observed MDP (POMDP) that is <em>implicitly created</em> from the agent’s epistemic uncertainty.</p>
<p>This POMDP, which we call the <strong>epistemic POMDP</strong>, works as follows. Recall that because the agent has only seen a limited training set, there are many possible environments that are consistent with the training contexts provided. The set of consistent environments can be encoded by a Bayesian posterior over environments $P(M \mid D)$. Each episode in the epistemic POMDP, an agent is dropped into one of these “consistent” environments $M \sim P(M \mid D)$, and asked to maximize return within it, but with the following important detail: the agent is not told which environment $M$ it was placed in.</p>
<p>This system corresponds to a POMDP (partially observed MDP), since the relevant information needed to act is only partially observable to the agent: although the state $s$ within the environment is observed, the identity of the environment $M$ that is generating these states is hidden from the agent. The epistemic POMDP provides an instantiation of the generalization problem into the Bayesian RL framework (see survey <a href="https://arxiv.org/abs/1609.04436">here</a>), which more generally studies optimal behavior under distributions over MDPs.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.003.png" width="90%" />
</p>
<p><small>
<i><b>Fig 4.</b> In the epistemic POMDP, an agent interacts with a different “consistent” environment in each episode, but does not know which one it is interacting with, leading to partial observability. To do well, an agent must employ a (potentially memory-based) strategy that works well no matter which of these environments it is placed in. </i>
</small></p>
<p>Let’s walk through an example of what the epistemic POMDP looks like. For the guessing game, the agent is uncertain about exactly how images are labelled, so each possible environment $M \sim P(M \mid D)$ corresponds to a different image labeller that is consistent with the training dataset: $f_M: X \to Y$. In the epistemic POMDP for the guessing game, each episode, an image $x$ and labeller $f_M$ are chosen at random, and the agent required to output the label that is assigned by the sampled classifier $y = f_M(x)$. The agent cannot do this directly, because the identity of the classifier $f_M$ is <em>not provided</em> to the agent, only the image $x$. If all the labellers $f_M$ in the posterior agree on the label for a certain image, the agent can just output this label (no partial observability). However, if different classifiers assign different labels, the agent must use a strategy that works well on average, regardless of which of the labellers was used to label the data (for example, by adaptive process-of-elimination guessing or randomized guessing).</p>
<p>What makes the epistemic POMDP particularly exciting is the following equivalence:</p>
<blockquote>
<p>An RL agent is <strong>Bayes-optimal for generalization</strong> if and only if it <strong>maximizes expected return in the corresponding epistemic POMDP</strong>. More generally, the performance of an agent in the epistemic POMDP dictates how well it is expected to generalize at evaluation time.</p>
</blockquote>
<p>That generalization performance is dictated by performance in the epistemic POMDP hints at a few lessons for bridging the gap between the “optimal” way to generalize in RL and current practices. For example, it is relatively well-known that the optimal policy in a POMDP is generally non-Markovian (adaptive based on history), and may take information-gathering actions to reduce the degree of partial observability. This means that to generalize optimally, we are likely to need adaptive information-gathering behaviors instead of the static Markovian policies that are usually trained.</p>
<p>The epistemic POMDP also highlights the perils of our predominant approach to learning policies from a limited training set of contexts: running a fully-observable RL algorithm on the training set. These algorithms model the environment as an MDP and learn MDP-optimal strategies, which are deterministic and Markov. These policies do not account for partial observability, and therefore tend to generalize poorly (for example, in the guessing game and maze tasks). This indicates a mismatch between the MDP-based training objectives that are standard in modern algorithms and the epistemic POMDP training objective that actually dictates how well the learned policy generalizes.</p>
<h2 id="moving-forward-with-generalization-in-rl">Moving Forward with Generalization in RL</h2>
<p>The implicit presence of partial observability at test time may explain why standard RL algorithms, which optimize fully-observed MDP objectives, fail to generalize. What should we do instead to learn RL policies that generalize better? The epistemic POMDP provides a prescriptive solution: when the agent’s posterior distribution over environments can be calculated, then constructing the epistemic POMDP and running a POMDP-solving algorithm on it will yield policies that generalize Bayes-optimally.</p>
<p>Unfortunately, in most interesting problems, this cannot be exactly done. Nonetheless, the epistemic POMDP can serve as a lodestar for designing RL algorithms that generalize better. As a first step, in our NeurIPS 2021 paper, we introduce an algorithm called LEEP, which uses statistical bootstrapping to learn a policy in an approximation of the epistemic POMDP. On Procgen, a challenging generalization benchmark for RL agents, LEEP improves significantly in test-time performance over PPO (Figure 3). While only a crude approximation, LEEP provides some indication that attempting to learn a policy in the epistemic POMDP can be a fruitful avenue for developing more generalizable RL algorithms.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/epistemic_pomdp/blog_figs.005.png" width="90%" />
</p>
<p><small>
<i><b>Fig 5.</b> LEEP, an algorithm based on the epistemic POMDP objective, generalizes better than PPO in four Procgen tasks. </i>
</small></p>
<hr />
<h2 id="if-you-take-one-lesson-from-this-blog-post">If you take one lesson from this blog post…</h2>
<p>In supervised learning, optimizing for performance on the training set translates to good generalization performance, and it is tempting to suppose that generalization in RL can be solved in the same manner. This is surprisingly <strong>not true</strong>; limited training data in RL introduces <em>implicit partial observability</em> into an otherwise fully-observable problem. This implicit partial observability, as formalized by <em>the epistemic POMDP</em>, means that generalizing well in RL necessitates adaptive or stochastic behaviors, hallmarks of POMDP problems.</p>
<p>Ultimately, this highlights the incompatibility that afflicts generalization of our deep RL algorithms: with limited training data, our MDP-based RL objectives are misaligned with the implicit POMDP objective that ultimately dictates generalization performance.</p>
<p><small><em>This post is based on <a href="https://arxiv.org/abs/2107.06277">the paper</a> “Why Generalization in RL is Difficult: Epistemic POMDPs and Implicit Partial Observability,” which is joint work with Jad Rahme (equal contribution), Aviral Kumar, Amy Zhang, Ryan P. Adams, and Sergey Levine. Thanks to Sergey Levine and Katie Kang for helpful feedback on the blog post.</em>
</small></p>
Mon, 01 Nov 2021 02:00:00 -0700
http://bair.berkeley.edu/blog/2021/11/01/epistemic-pomdp/
http://bair.berkeley.edu/blog/2021/11/01/epistemic-pomdp/Designs from Data: Offline Black-Box Optimization via Conservative Training<!--
These are comments in HTML. The above header text is needed to format the
title, authors, etc. The "example_post" is an example representative image (not
GIF) that we use for each post for tweeting (see below as well) and for the
emails to subscribers. Please provide this image (and any other images and
GIFs) in the blog to the BAIR Blog editors directly.
The text directly below gets tweets to work. Please adjust according to your
post.
The `static/blog` directory is a location on the blog server which permanently
stores the images/GIFs in BAIR Blog posts. Each post has a subdirectory under
this for its images (titled `example_post` here, please change).
Keeping the post visbility as False will mean the post is only accessible if
you know the exact URL.
You can also turn on Disqus comments, but we recommend disabling this feature.
-->
<!-- twitter -->
<meta name="twitter:title" content="Designs from Data: Offline Black-Box Optimization via Conservative Training" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/val/image3.gif" />
<meta name="keywords" content="Model-Based Optimization, Offline MBO, Data-Driven Design" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Aviral Kumar, Xinyang (Young) Geng" />
<!--
The actual text for the post content appears below. Text will appear on the
homepage, i.e., https://bair.berkeley.edu/blog/ but we only show part of the
posts on the homepage. The rest is accessed via clicking 'Continue'. This is
enforced with the `more` excerpt separator.
-->
<p style="text-align:center;">
<!-- <img src="https://paper-attachments.dropbox.com/s_1976FF3EBAEC6A222420E2F13728C8F2EBE37201A6EB11D0C8EB293892AB744D_1634615466795_image.png" width="100%"> -->
<img src="https://paper-attachments.dropbox.com/s_253C1E1E33C7401FC87F81206FF716AB292A32AE3297FF43835BB49F0EB175CE_1635171387655_coms_gif.gif" width="100%" />
<br />
<i>Figure 1: Offline Model-Based Optimization (MBO): The goal of offline MBO is to optimize an unknown objective function $f(x)$ with respect to $x$, provided access to only as static, previously-collected dataset of designs.</i>
</p>
<p>Machine learning methods have shown tremendous promise on prediction problems: predicting the efficacy of a drug, predicting how a protein will fold, or predicting the strength of a composite material. But can we use machine learning for design? Conventionally, such problems have been tackled with black-box optimization procedures that repeatedly query an objective function. For instance, if designing a drug, the algorithm will iteratively modify the drug, test it, then modify it again. But when evaluating the efficacy of a candidate design involves conducting a real-world experiment, this can quickly become prohibitive. An appealing alternative is to create designs from data. Instead of requiring active synthesis and querying, can we devise a method that simply examines a large dataset of previously tested designs (e.g., drugs that have been evaluated before), and comes up with a new design that is better? We call this <strong>offline model-based optimization (offline MBO)</strong>, and in this post, we discuss offline MBO methods and some recent advances.</p>
<!--more-->
<h1 id="offline-model-based-optimization-offline-mbo">Offline Model-Based Optimization (Offline MBO)</h1>
<p>Formally, the goal in offline model-based optimization is to maximize a black-box objective function $f(x)$ with respect to its input $x$, where the access to the true objective function is not available. Instead, the algorithm is provided access to a static dataset $\mathcal{D} = {(x_i, y_i)}$ of designs $x_i$ and corresponding objective values $y_i$. The algorithm consumes this dataset and produces an optimized candidate design, which is evaluated against the true objective function. Abstractly, the objective for offline MBO can be written as $\arg\max_{x = \mathcal{A}(D)} f(x)$, where $x = \mathcal{A}(D)$ indicates the design $x$ is a function of our dataset $\mathcal{D}$.</p>
<h2 id="what-makes-offline-mbo-challenging">What makes offline MBO challenging?</h2>
<p>The offline nature of the problem prevents the algorithm from querying the ground truth objective, which makes the offline MBO problem much more difficult than the online counterpart. One obvious approach to tackle an offline MBO problem is to learn a model $\hat{f}(x)$ of the objective function using the dataset, and then applying methods from the more standard online optimization problem by treating the learned objective model as the true objective.</p>
<p style="text-align:center;float:right">
<img src="https://paper-attachments.dropbox.com/s_253C1E1E33C7401FC87F81206FF716AB292A32AE3297FF43835BB49F0EB175CE_1634447516024_Screenshot+2021-10-16+at+10.11.49+PM.png" width="40%" />
<br />
<i>Figure 2: Overestimation at unseen inputs in the naive objective model fools the optimizer. Our conservative model prevents overestimation, and mitigates the optimizer from finding bad designs with erroneously high values.</i>
</p>
<p>However, this generally does not work: optimizing the design against the learned proxy model will produce <strong>out-of-distribution</strong> designs that “fool” the learned objective model into outputting a high value, similar to adversarial examples (see Figure 2 for an illustration). This is because that the learned model is trained on the dataset and therefore is only accurate for <strong>in-distribution</strong> designs. A naive strategy to address this out-of-distribution issue is to constrain the design to stay close to the data, but this is also problematic, since in order to produce a design that is better than the best training point, it is usually necessary to deviate from the training data, at least somewhat. Therefore, the conflict between the need to remain close to the data to avoid out-of-distribution inputs and the need to deviate from the data to produce better designs is one of the core challenges of offline MBO. This challenge is often exacerbated in real-world settings by the high dimensionality of the design space and the sparsity of the available data. A good offline MBO method needs to carefully balance these two sides, producing optimized designs that are good, but not too far from the data distribution.</p>
<h2 id="what-prevents-offline-mbo-from-simply-copying-over-the-best-design-in-the-dataset">What prevents offline MBO from simply copying over the best design in the dataset?</h2>
<p>One of the fundamental requirements for any effective offline MBO method is that it must improve over the best design observed in the training dataset. If this requirement is not met, one could simply return the best design from the dataset, without needing to run any kind of learning algorithm. When is such an improvement achievable in offline MBO problems? Offline MBO methods can improve over the best design in the dataset when the underlying design space exhibits “compositional structure”. For gaining intuition, consider an example, where the objective function can be represented as a sum of functions of independent partitions of the design variables, i.e., $f(x) = f_1(x[1]) + f_2(x[2]) + \cdots + f_N(x[N]))$, where $x[1], \cdots, x[N]$ denotes disjoint subsets of design variables $x$. The dataset of the offline MBO problem contains optimal design variable for each partition, but not the combination. If an algorithm can identify the compositional structure of the problem, it would be able to combine the optimal design variable for each partition together to obtain overall optimal design and therefore improving the performance over the best design in the dataset. To better demonstrate this idea, we created a toy problem in 2 dimensions and applied a naive MBO method that learns a model of the objective function via supervised regression, and then optimizes the learned estimate, as shown in the figure below. We can clearly see that the algorithm obtains the combined optimal $x$ and $y$, outperforming the best design in the dataset.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_1976FF3EBAEC6A222420E2F13728C8F2EBE37201A6EB11D0C8EB293892AB744D_1634101363476_image.png" width="80%" />
<br />
<i>Figure 3: Offline MBO finds designs better than the best in the observed dataset by exploiting compositional structure of the objective function $f(x, y) = -x^2 - y^2$ . Left: datapoints in a toy quadratic function MBO task over 2D space with optimum at $(0,0)$ in blue, MBO found design in red. Right: Objective value for optimal design is much higher than that observed in the dataset.</i>
</p>
<h1 id="prior-algorithms-for-offline-mbo">Prior Algorithms for Offline MBO</h1>
<p>Given an offline dataset, the obvious starting point is to learn a model $\hat{f}_\theta(x)$ of the objective function from the dataset. Most offline MBO methods would indeed employ some form of learned model $\hat{f}_\theta(x)$ trained on the dataset to predict the objective value and guide the optimization process. As discussed previously, a very simple and naive baseline for offline MBO is to treat $\hat{f}_\theta(x)$ as the proxy to the true objective model and use <strong>gradient ascent</strong> to optimize $\hat{f}_\theta(x)$ with respect to $x$. However, this method often fails in practice, as gradient ascent can easily find designs that “fool” the model to predict a high objective value, similar to how adversarial examples are generated. Therefore, a successful approach using the learned model must prevent out-of-distribution designs that cause the model to overestimate the objective values, and the prior works have adopted different strategies to accomplish this.</p>
<p>A straightforward idea for preventing out-of-distribution data is to explicitly model the data distribution and constraint our designs to be within the distribution. Often the data distribution modeling is done via a generative model. <a href="https://arxiv.org/abs/1901.10060">CbAS</a> and <a href="https://arxiv.org/abs/2006.08052">Autofocusing CbAS</a> use a variational auto-encoder to model the distribution of designs, and <a href="https://arxiv.org/abs/1912.13464">MINs</a> use a conditional generative adversarial network to model the distribution of designs conditioned on the objective value. However, generative modeling is a difficult problem. Furthermore, in order to be effective, generative models must be accurate near the tail ends of the data distribution as offline MBO must deviate from being close to the dataset to find improved designs. This imposes a strong feasibility requirement on such generative models.</p>
<h1 id="conservative-objective-models">Conservative Objective Models</h1>
<p>Can we devise an offline MBO method that does not utilize generative models, but also avoids the problems with the naive gradient-ascent based MBO method? To prevent this simple gradient ascent optimizer from getting “fooled” by the erroneously high values $\hat{f}_\theta(x)$ at out-of-distribution inputs, our approach, conservative objective models (COMs) performs a simple modification to the naive approach of training a model of the objective function. Instead of training a model $\hat{f}_\theta(x)$ via standard supervised regression, COMs applies an additional regularizer that minimizes the value of the learned model $\hat{f}_\theta(x^-)$ on <em>adversarial</em> designs $x^-$ that are likely to attain erroneously overestimated values. Such adversarial designs are the ones that likely appear falsely optimistic under the learned model, and by minimizing their values $\hat{f}_\theta(x^-)$, COMs prevents the optimizer from finding poor designs. This procedure superficially resembles a form of adversarial training.</p>
<p><strong>How can we obtain such adversarial designs</strong> $x^-$? A straightforward approach for finding such adversarial designs is by running the optimizerwhich will be used to finally obtain optimized designs after training on a partially trained function $\hat{f}_\theta$. For example, in our experiments on continuous-dimensional design spaces, we utilize a gradient-ascent optimizer, and hence, run a few iterations of gradient ascent on a given snapshot of the learned function to obtain $x^-$. Given these designs, the regularizer in COMs pushes down the learned value $\hat{f}_\theta(x^-)$. To counter balance this push towards minimizing function values, COMs also additionally maximizes the learned $\hat{f}_\theta(x)$ on the designs observed in the dataset, $x \sim \mathcal{D}$, for which the ground truth value of $f(x)$ is known. This idea is illustratively depicted below.</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_253C1E1E33C7401FC87F81206FF716AB292A32AE3297FF43835BB49F0EB175CE_1634081225294_Screenshot+2021-10-12+at+4.26.58+PM.png" width="80%" />
<br />
<i>Figure 4: A schematic procedure depicting training in COMs: COM performs supervised regression on the training data, pushes down the value of adversarially generated designs and counterbalances the effect by pushing up the value of the learned objective model on the observed datapoints</i>
</p>
<p>Denoting the samples found by running gradient-ascent in the inner loop as coming from a distribution $\mu(x)$, the training objective for COMs is given by:</p>
\[\theta^* \leftarrow \arg \min_\theta {\alpha \left(\mathbb{E}_{x^- \sim \mu(x)}[\hat{f}_\theta(x^-)] - \mathbb{E}_{x \sim \mathcal{D}}[\hat{f}_\theta(x)] \right)} + \frac{1}{2} \mathbb{E}_{(x, y) \sim \mathcal{D}} [(\hat{f}_\theta(x) - y)^2].\]
<p>This objective can be implemented as shown in the following (python) code snippet:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">mine_adversarial</span><span class="p">(</span><span class="n">x_0</span><span class="p">,</span> <span class="n">current_model</span><span class="p">):</span>
<span class="n">x_i</span> <span class="o">=</span> <span class="n">x_0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">T</span><span class="p">):</span>
<span class="c1"># gradient of current_model w.r.t. x_i
</span> <span class="n">x_i</span> <span class="o">=</span> <span class="n">x_i</span> <span class="o">+</span> <span class="n">grad</span><span class="p">(</span><span class="n">current_model</span><span class="p">,</span> <span class="n">x_i</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x_i</span>
<span class="k">def</span> <span class="nf">coms_training_loss</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">mse_loss</span> <span class="o">=</span> <span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span>
<span class="n">regularizer</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">mine_adversarial</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">model</span><span class="p">))</span> <span class="o">-</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">mse_loss</span> <span class="o">*</span> <span class="mf">0.5</span> <span class="o">+</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">regularizer</span>
</code></pre></div></div>
<p>Non-generative offline MBO methods can also be designed in other ways. For example, instead of training a conservative model as in COMs, we can instead train model to capture uncertainty in the predictions of a standard model. One example of this is <a href="https://arxiv.org/abs/2102.07970">NEMO</a>, which uses a normalized maximum likelihood (<a href="https://arxiv.org/abs/2011.02696">NML</a>) formulation to provide uncertainty estimates.</p>
<h1 id="how-do-coms-perform-in-practice">How do COMs Perform in Practice?</h1>
<p>We evaluated COMs on a number of design problems in biology (designing a <a href="https://www.nature.com/articles/nature17995">GFP protein to maximize fluorescence</a>, designing <a href="https://www.science.org/doi/10.1126/science.aad2257">DNA sequences to maximize binding affinity to various transcription factors</a>), materials design (designing a <a href="https://arxiv.org/abs/1803.10260">superconducting material</a> with the highest critical temperature), robot morphology design (designing the morphology of <a href="https://arxiv.org/abs/1909.11639">DâKitty</a> and <a href="https://gym.openai.com/">Ant</a> robots to maximize performance) and robot controller design (optimizing the parameters of a neural network controller for the <a href="https://gym.openai.com/">Hopper</a> domain in OpenAI Gym). These tasks consist of domains with both discrete and continuous design spaces and span both low and high-dimensional tasks. We found that COMs outperform several prior approaches on these tasks, a subset of which is shown below. Observe that COMs consistently find a better design than the best in the dataset, and outperforms other generative modeling based prior MBO approaches (MINs, CbAS, Autofocusing CbAS) that pay a price for modeling the manifold of the design space, especially in problems such as Hopper Controller ($\geq 5000$ dimensions).</p>
<p style="text-align:center;">
<img src="https://paper-attachments.dropbox.com/s_253C1E1E33C7401FC87F81206FF716AB292A32AE3297FF43835BB49F0EB175CE_1634179059934_Screenshot+2021-10-13+at+7.37.34+PM.png" width="100%" />
<br />
<i>Table 1: Comparing the performance of COMs with prior offline MBO methods. Note that COMs generally outperform prior approaches, including those based on generative models, which especially struggle in high-dimensional problems such as Hopper Controller.</i>
</p>
<p>Empirical results on other domains can be found in <a href="https://arxiv.org/abs/2107.06882">our paper</a>. To conclude our discussion of empirical results, we note that a <a href="https://arxiv.org/abs/2110.11346">recent paper</a> devises an offline MBO approach to optimize hardware accelerators in a real hardware-design workflow, building on COMs. As shown in <a href="https://arxiv.org/abs/2110.11346">Kumar et al. 2021</a> (Tables 3, 4), this COMs-inspired approach finds better designs than various prior state-of-the-art online MBO methods that access the simulator via time-consuming simulation. While, in principle, one can always design an online method that should perform better than any offline MBO method (for example, by wrapping an offline MBO method within an active data collection strategy), good performance of offline MBO methods inspired by COMs indicates the efficacy and the potential of offline MBO approaches in solving design problems.</p>
<h1 id="discussion-open-problems-and-future-work">Discussion, Open Problems and Future Work</h1>
<p>While COMs present a simple and effective approach for tackling offline MBO problems, there are several important open questions that need to be tackled. Perhaps the most straightforward open question is to devise better algorithms that combine the benefits of both generative approaches and COMs-style conservative approaches. Beyond algorithm design, perhaps one of the most important open problems is designing effective <strong>cross-validation strategies:</strong> in supervised <em>prediction</em> problems, a practitioner can adjust model capacity, add regularization, tune hyperparameters and make design decisions by simply looking at validation performance. Improving the validation performance will likely also improve the test performance because validation and test samples are distributed identically and generalization guarantees for ERM theoretically quantify this. However, such a workflow cannot be applied directly to offline MBO, because cross-validation in offline MBO requires assessing the accuracy of counterfactual predictions under distributional shift. Some recent <a href="https://arxiv.org/abs/2110.11346">work</a> utilizes practical heuristics such as validation performance computed on a held-out dataset consisting of only âspecialâ designs (e.g., only the top-k best designs) for cross-validation of COMs-inspired methods, which seems to perform reasonably well in practice. However, it is not clear that this is the optimal strategy one can use for cross-validation. We expect that much more effective strategies can be developed by understanding the effects of various factors (such as the capacity of the neural network representing $\hat{f}_\theta(x)$, the hyperparameter $\alpha$ in COMs, etc.) on the dynamics of optimization of COMs and other MBO methods.</p>
<p>Another important open question is <strong>characterizing properties of datasets and data distributions</strong> that are amenable to effective offline MBO methods. The success of deep learning indicates that not just better methods and algorithms are required for good performance, but that the performance of deep learning methods heavily depends on the data distribution used for training. Analogously, we expect that the performance of offline MBO methods also depends on the quality of data used. For instance, in the didactic example in Figure 3, no improvement could have been possible via offline MBO if the data were localized along a thin line parallel to the x-axis. This means that understanding the relationship between offline MBO solutions and the data-distribution, and effective dataset design based on such principles is likely to have a large impact. We hope that research in these directions, combined with advances in offline MBO methods, would enable us to solve challenging design problems in various domains.</p>
<hr />
<p><i> We thank Sergey Levine for valuable feedback on this post. We thank Brandon Trabucco for making Figures 1 and 2 of this post. This blog post is based on the following paper:</i></p>
<p><strong>Conservative Objective Models for Effective Offline Model-Based Optimization</strong><br />
Brandon Trabucco*, Aviral Kumar*, Xinyang Geng, Sergey Levine.<br />
<em>In International Conference on Machine Learning (ICML), 2021.</em> <a href="https://arxiv.org/abs/2107.06882">arXiv</a> <a href="https://github.com/brandontrabucco/design-baselines">code</a> <a href="https://sites.google.com/berkeley.edu/coms">website</a><br />
Short descriptive video: <a href="https://youtu.be/bMIlHl3KIfU">https://youtu.be/bMIlHl3KIfU</a></p>
Mon, 25 Oct 2021 06:00:00 -0700
http://bair.berkeley.edu/blog/2021/10/25/coms_mbo/
http://bair.berkeley.edu/blog/2021/10/25/coms_mbo/A First-Principles Theory of Neural<br>Network Generalization<!-- twitter -->
<meta name="twitter:title" content="A First-Principles Theory of Neural Network Generalization" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/eigenlearning/eigenlearning_blog_post_fig1.mp4" />
<meta name="keywords" content="deep learning, generalization, neural tangent kernel" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Jamie Simon" />
<p style="text-align:center;">
<video autoplay="" loop="" muted="" playsinline="" width="80%" style="display:block; margin: 0 auto;">
<source src="https://bair.berkeley.edu/static/blog/eigenlearning/eigenlearning_blog_post_fig1.mp4" type="video/mp4" />
</video>
</p>
<p><small>
<i><b>Fig 1.</b> Measures of generalization performance for neural networks trained on four different boolean functions (colors) with varying training set size. For both MSE (left) and learnability (right), theoretical predictions (curves) closely match true performance (dots). </i>
</small></p>
<p>Deep learning has proven a stunning success for countless problems of interest, but this success belies the fact that, at a fundamental level, we do not understand why it works so well. Many empirical phenomena, well-known to deep learning practitioners, remain mysteries to theoreticians. Perhaps the greatest of these mysteries has been the question of generalization: <em>why do the functions learned by neural networks generalize so well to unseen data?</em> From the perspective of classical ML, neural nets’ high performance is a surprise given that they are so overparameterized that they could easily represent countless poorly-generalizing functions.</p>
<!--more-->
<p>Questions beginning in “why” are difficult to get a grip on, so we instead take up the following quantitative problem: <em>given a network architecture, a target function $f$, and a training set of $n$ random examples, can we efficiently predict the generalization performance of the network’s learned function $\hat{f}$?</em> A theory doing this would not only explain why neural networks generalize well on certain functions but would also tell us which function classes a given architecture is well-suited for and potentially even let us choose the best architecture for a given problem from first principles, as well as serving as a general framework for addressing a slew of other deep learning mysteries.</p>
<p>It turns out this is possible: in our recent <a href="https://arxiv.org/abs/2110.03922">paper</a>, <em>we derive a first-principles theory that allows one to make accurate predictions of neural network generalization</em> (at least in certain settings). To do so, we make a chain of approximations, first approximating a real network as an idealized infinite-width network, which is known to be equivalent to kernel regression, then deriving new approximate results for the generalization of kernel regression to yield a few simple equations that, despite these approximations, closely predict the generalization performance of the original network.</p>
<h2 id="finite-network-approx-infinite-width-network--kernel-regression"><strong>Finite network $\approx$ infinite-width network $=$ kernel regression</strong></h2>
<p>A major vein of deep learning theory in the last few years has studied neural networks of infinite width. One might guess that adding more parameters to a network would only make it harder to understand, but, by results akin to central limit theorems for neural nets, infinite-width nets actually take very simple analytical forms. In particular, a wide network trained by gradient descent to zero MSE loss will always learn the function</p>
\[\hat{f}(x) = K(x, \mathcal{D}) K(\mathcal{D}, \mathcal{D})^{-1} f(\mathcal{D}),\]
<p>where $\mathcal{D}$ is the dataset, $f$ and $\hat{f}$ are the target and learned functions respectively, and $K$ is the network’s <a href="https://arxiv.org/abs/1806.07572">“neural tangent kernel” (NTK)</a>. This is a matrix equation: $K(x, \mathcal{D})$ is a row vector, $K(\mathcal{D}, \mathcal{D})$ is the “kernel matrix,” and $f(\mathcal{D})$ is a column vector. The NTK is different for every architecture class but (at least for wide nets) the same every time you initialize. Because of this equation’s similarity to the normal equation of linear regression, it goes by the name of “kernel regression.”</p>
<p>The sheer simplicity of this equation might make one suspect that an infinite-width net is an absurd idealization with little resemblance to useful networks, but experiments show that, as with the regular central limit theorem, infinite-width results usually kick in sooner than you’d expect, at widths in only the hundreds. Trusting that this first approximation will bear weight, our challenge now is to understand kernel regression.</p>
<h2 id="approximating-the-generalization-of-kernel-regression"><strong>Approximating the generalization of kernel regression</strong></h2>
<p>In deriving the generalization of kernel regression, we get a lot of mileage from a simple trick: we look at the learning problem in the eigenbasis of the kernel. Viewed as a linear operator, the kernel has eigenvalue/vector pairs $(\lambda_i, \phi_i)$ defined by the condition that</p>
\[\int\limits_{\text{input space}} \! \! \! \! \! \! K(x, x’) \phi_i(x’) d x’ = \lambda_i \phi_i(x).\]
<p>Intuitively speaking, a kernel is a similarity function, and we can interpet its high-eigenvalue eigenfunctions as mapping “similar” points to similar values.</p>
<p>The centerpiece of our analysis is a measure of generalization we call “learnability” which quantifies the alignment of $f$ and $\hat{f}$. With a few minor approximations, we derive the extremely simple result that the learnability of each eigenfunction is given by</p>
\[\mathcal{L}(\phi_i) = \frac{\lambda_i}{\lambda_i + C},\]
<p>where $C$ is a constant. Higher learnability is better, and thus this formula tells us that <em>higher-eigenvalue eigenfunctions are easier to learn!</em> Moreover, we show that, as examples are added to the training set, $C$ gradually decreases from $\infty$ to $0$, which means that each mode’s $\mathcal{L}(\phi_i)$ gradually increases from $0$ to $1$, with higher eigenmodes learned first. Models of this form have a strong inductive bias towards learning higher eigenmodes.</p>
<p>We ultimately derive expressions for not just learnability but for <em>all first- and second-order statistics of the learned function,</em> including recovering previous expressions for MSE. We find that these expressions are quite accurate for not just kernel regression but finite networks, too, as illustrated in Fig 1.</p>
<h2 id="no-free-lunch-for-neural-networks"><strong>No free lunch for neural networks</strong></h2>
<p>In addition to approximations for generalization performance, we also prove a simple exact result we call the “no-free-lunch theorem for kernel regression.” The classical <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.390.9412&rep=rep1&type=pdf">no-free-lunch theorem for learning algorithms</a> roughly states that, averaged over all possible target functions $f$, any supervised learning algorithm has the same expected generalization performance. This makes intuitive sense - after all, most functions look like white noise, with no discernable patterns - but it is also not very useful since the set of “all functions” is usually enormous. Our extension, specific to kernel regression, essentially states that</p>
\[\begin{align}
\sum_i \mathcal{L}(\phi_i) = \text{[training set size]}.
\end{align}\]
<p>That is, the sum of learnabilities across all kernel eigenfunctions equals the training set size. This exact result paints a vivid picture of a kernel’s inductive bias: the kernel has exactly $\text{[training set size]}$ units of learnability to parcel out to its eigenmodes - no more, no less - and thus eigenmodes are locked in a zero-sum competition to be learned. As shown in Fig 2, we find that this basic conservation law holds exactly for NTK regression and even approximately for finite networks. To our knowledge, this is the first result quantifying such a tradeoff in kernel regression or deep learning. It also applies to linear regression, a special case of kernel regression.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/eigenlearning/eigenlearning_blog_post_fig2.png" width="70%" />
</p>
<p><small>
<i><b>Fig 2.</b> For four different network architectures (fully-connected $\text{ReLU}$ and $\text{tanh}$ nets with one or four hidden layers), total learnability summed across all eigenfunctions is equal to the size of the training set. Colored components show learnabilities of individual eigenfunctions. For kernel regression with the network’s NTK (left bar in each pair), the sum is exactly the trainset size, while real trained networks (right bar in each pair) sum to approximately the trainset size. </i>
</small></p>
<h2 id="conclusion"><strong>Conclusion</strong></h2>
<p>These results show that, despite neural nets’ notorious inscrutability, we can nonetheless hope to understand when and why they work well. As in other fields of science, if we take a step back, we can find simple rules governing what naively appear to be systems of incomprehensible complexity. More work certainly remains to be done before we truly understand deep learning - our theory only applies to MSE loss, and the NTK’s eigensystem is yet unknown in all but the simplest cases - but our results so far suggest we have the makings of a bona fide theory of neural network generalization on our hands.</p>
<hr />
<p><em>This post is based on <a href="https://arxiv.org/abs/2110.03922">the paper</a> “Neural Tangent Kernel Eigenvalues Accurately Predict Generalization,” which is joint work with labmate Maddie Dickens and advisor Mike DeWeese. We provide <a href="https://github.com/james-simon/eigenlearning">code</a> to reproduce all our results. We’d be delighted to field your questions or comments.</em></p>
Mon, 25 Oct 2021 02:00:00 -0700
http://bair.berkeley.edu/blog/2021/10/25/eigenlearning/
http://bair.berkeley.edu/blog/2021/10/25/eigenlearning/Making RL Tractable by Learning More Informative Reward Functions: Example-Based Control, Meta-Learning, and Normalized Maximum Likelihood<!--
These are comments in HTML. The above header text is needed to format the
title, authors, etc. The "example_post" is an example representative image (not
GIF) that we use for each post for tweeting (see below as well) and for the
emails to subscribers. Please provide this image (and any other images and
GIFs) in the blog to the BAIR Blog editors directly.
The text directly below gets tweets to work. Please adjust according to your
post.
The `static/blog` directory is a location on the blog server which permanently
stores the images/GIFs in BAIR Blog posts. Each post has a subdirectory under
this for its images (titled `example_post` here, please change).
Keeping the post visbility as False will mean the post is only accessible if
you know the exact URL.
You can also turn on Disqus comments, but we recommend disabling this feature.
-->
<!-- twitter -->
<meta name="twitter:title" content="Making RL Tractable by Learning More Informative Reward Functions: Example-Based Control, Meta-Learning, and Normalized Maximum Likelihood" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:image" content="https://bair.berkeley.edu/static/blog/val/image3.gif" />
<meta name="keywords" content="Reward Inference, Reinforcement Learning, Robotics" />
<meta name="description" content="The BAIR Blog" />
<meta name="author" content="Abhishek Gupta, Kevin Li, Sergey Levine" />
<!--
The actual text for the post content appears below. Text will appear on the
homepage, i.e., https://bair.berkeley.edu/blog/ but we only show part of the
posts on the homepage. The rest is accessed via clicking 'Continue'. This is
enforced with the `more` excerpt separator.
-->
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_1.png" width="100%" />
<br />
<i> Diagram of MURAL, our method for learning uncertainty-aware rewards for RL. After the user provides a few examples of desired outcomes, MURAL automatically infers a reward function that takes into account these examples and the agent’s uncertainty for each state.
</i>
</p>
<p>Although reinforcement learning has shown success in domains <a href="https://arxiv.org/abs/1504.00702">such</a> <a href="https://arxiv.org/abs/2104.11203">as</a> <a href="https://arxiv.org/abs/1909.11652">robotics</a>, chip <a href="https://arxiv.org/abs/2004.10746">placement</a> and <a href="https://www.nature.com/articles/s41586-019-1724-z">playing</a> <a href="https://arxiv.org/abs/1912.06680">video</a> <a href="https://www.nature.com/articles/nature16961">games</a>, it is usually intractable in its most general form. In particular, deciding when and how to visit new states in the hopes of learning more about the environment can be challenging, especially when the reward signal is uninformative. These questions of reward specification and exploration are closely connected — the more directed and “well shaped” a reward function is, the easier the problem of exploration becomes. The answer to the question of how to explore most effectively is likely to be closely informed by the particular choice of how we specify rewards.</p>
<p>For unstructured problem settings such as robotic manipulation and navigation — areas where RL holds substantial promise for enabling better real-world intelligent agents — reward specification is often the key factor preventing us from tackling more difficult tasks. The challenge of effective reward specification is two-fold: we require reward functions that can be specified in the real world without significantly instrumenting the environment, but also effectively guide the agent to solve difficult exploration problems. In our recent work, we address this challenge by designing a reward specification technique that naturally incentivizes exploration and enables agents to explore environments in a directed way.</p>
<!--more-->
<h1 id="outcome-driven-rl-and-classifier-based-rewards">Outcome Driven RL and Classifier Based Rewards</h1>
<p>While RL in its most general form can be quite difficult to tackle, we can consider a more controlled set of subproblems which are more tractable while still encompassing a significant set of interesting problems. In particular, we consider a subclass of problems which has been referred to as <a href="https://proceedings.neurips.cc/paper/2018/file/c9319967c038f9b923068dabdf60cfe3-Paper.pdf">outcome driven RL</a>. In outcome driven RL problems, the agent is not simply tasked with exploring the environment until it chances upon reward, but instead is provided with examples of successful outcomes in the environment. These successful outcomes can then be used to infer a suitable reward function that can be optimized to solve the desired problems in new scenarios.</p>
<p>More concretely, in outcome driven RL problems, a human supervisor first provides a set of successful outcome examples ${s_g^i}_{i=1}^N$, representing states in which the desired task has been accomplished. Given these outcome examples, a suitable reward function $r(s, a)$ can be inferred that encourages an agent to achieve the desired outcome examples. In many ways, this problem is analogous to that of inverse reinforcement learning, but only requires examples of successful states rather than full expert demonstrations.</p>
<p>When thinking about how to actually infer the desired reward function $r(s, a)$ from successful outcome examples ${s_g^i}_{i=1}^N$, the simplest technique that comes to mind is to simply treat the reward inference problem as a classification problem - “Is the current state a successful outcome or not?” <a href="https://proceedings.neurips.cc/paper/2018/file/c9319967c038f9b923068dabdf60cfe3-Paper.pdf">Prior</a> <a href="https://arxiv.org/abs/1904.07854">work</a> has implemented this intuition, inferring rewards by training a simple binary classifier to distinguish whether a particular state $s$ is a successful outcome or not, using the set of provided goal states as positives, and all on-policy samples as negatives. The algorithm then assigns rewards to a particular state using the success probabilities from the classifier. This has been shown to have a close connection to the framework of inverse reinforcement learning.</p>
<p>Classifier-based methods provide a much more intuitive way to specify desired outcomes, removing the need for hand-designed reward functions or demonstrations:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_2.png" width="100%" />
<br />
</p>
<p>These classifier-based methods have achieved promising results on robotics tasks such as fabric placement, mug pushing, bead and screw manipulation, and more. However, these successes tend to be limited to simple shorter-horizon tasks, where relatively little exploration is required to find the goal.</p>
<h1 id="whats-missing">What’s Missing?</h1>
<p>Standard success classifiers in RL suffer from the key issue of overconfidence, which prevents them from providing useful shaping for hard exploration tasks. To understand why, let’s consider a toy 2D maze environment where the agent must navigate in a zigzag path from the top left to the bottom right corner. During training, classifier-based methods would label all on-policy states as negatives and user-provided outcome examples as positives. A typical neural network classifier would easily assign success probabilities of 0 to all visited states, resulting in uninformative rewards in the intermediate stages when the goal has not been reached.</p>
<p>Since such rewards would not be useful for guiding the agent in any particular direction, prior works tend to regularize their classifiers using methods like weight decay or mixup, which allow for more smoothly increasing rewards as we approach the successful outcome states. However, while this works on many shorter-horizon tasks, such methods can actually produce very misleading rewards. For example, on the 2D maze, a regularized classifier would assign relatively high rewards to states on the opposite side of the wall from the true goal, since they are close to the goal in x-y space. This causes the agent to get stuck in a local optima, never bothering to explore beyond the final wall!</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_3.png" width="100%" />
<br />
</p>
<p>In fact, this is exactly what happens in practice:</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_4.gif" width="70%" />
<br />
</p>
<h1 id="uncertainty-aware-rewards-through-cnml">Uncertainty-Aware Rewards through CNML</h1>
<p>As discussed above, the key issue with unregularized success classifiers for RL is overconfidence — by immediately assigning rewards of 0 to all visited states, we close off many paths that might eventually lead to the goal. Ideally, we would like our classifier to have an appropriate notion of uncertainty when outputting success probabilities, so that we can avoid excessively low rewards without suffering from the misleading local optima that result from regularization.</p>
<p><strong>Conditional Normalized Maximum Likelihood (CNML)</strong>
<br />
One method particularly well-suited for this task is Conditional Normalized Maximum Likelihood (CNML). The concept of normalized maximum likelihood (NML) has typically been used in the Bayesian inference literature for model selection, to implement the minimum description length principle. In more recent work, NML has been adapted to the conditional setting to produce models that are much better calibrated and maintain a <a href="https://arxiv.org/abs/1812.09520">notion</a> of <a href="https://arxiv.org/abs/2011.02696">uncertainty</a>, while achieving optimal worst case classification regret. Given the challenges of overconfidence described above, this is an ideal choice for the problem of reward inference.</p>
<p>Rather than simply training models via maximum likelihood, CNML performs a more complex inference procedure to produce likelihoods for any point that is being queried for its label. Intuitively, CNML constructs a set of different maximum likelihood problems by labeling a particular query point $x$ with every possible label value that it might take, then outputs a final prediction based on how easily it was able to adapt to each of those proposed labels given the entire dataset observed thus far. Given a particular query point $x$, and a prior dataset $\mathcal{D} = \left[x_0, y_0, … x_N, y_N\right]$, CNML solves k different maximum likelihood problems and normalizes them to produce the desired label likelihood $p(y \mid x)$, where $k$ represents the number of possible values that the label may take. Formally, given a model $f(x)$, loss function $\mathcal{L}$, training dataset $\mathcal{D}$ with classes $\mathcal{C}_1, …, \mathcal{C}_k$, and a new query point $x_q$, CNML solves the following $k$ maximum likelihood problems:</p>
\[\theta_i = \text{arg}\max_{\theta} \mathbb{E}_{\mathcal{D} \cup (x_q, C_i)}\left[ \mathcal{L}(f_{\theta}(x), y)\right]\]
<p>It then generates predictions for each of the $k$ classes using their corresponding models, and normalizes the results for its final output:</p>
\[p_\text{CNML}(C_i|x) = \frac{f_{\theta_i}(x)}{\sum \limits_{j=1}^k f_{\theta_j}(x)}\]
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_7.png" width="100%" />
<br />
<i>Comparison of outputs from a standard classifier and a CNML classifier. CNML outputs more conservative predictions on points that are far from the training distribution, indicating uncertainty about those points’ true outputs. (Credit: Aurick Zhou, BAIR Blog)</i>
</p>
<p>Intuitively, if the query point is farther from the original training distribution represented by D, CNML will be able to more easily adapt to any arbitrary label in $\mathcal{C}_1, …, \mathcal{C}_k$, making the resulting predictions closer to uniform. In this way, CNML is able to produce better calibrated predictions, and maintain a clear notion of uncertainty based on which data point is being queried.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_8.png" width="100%" />
<br />
</p>
<p><strong>Leveraging CNML-based classifiers for Reward Inference</strong>
<br />
Given the above background on CNML as a means to produce better calibrated classifiers, it becomes clear that this provides us a straightforward way to address the overconfidence problem with classifier based rewards in outcome driven RL. By replacing a standard maximum likelihood classifier with one trained using CNML, we are able to capture a notion of uncertainty and obtain directed exploration for outcome driven RL. In fact, in the discrete case, CNML corresponds to imposing a uniform prior on the output space — in an RL setting, this is equivalent to using a count-based exploration bonus as the reward function. This turns out to give us a very appropriate notion of uncertainty in the rewards, and solves many of the exploration challenges present in classifier based RL.</p>
<p>However, we don’t usually operate in the discrete case. In most cases, we use expressive function approximators and the resulting representations of different states in the world share similarities. When a CNML based classifier is learned in this scenario, with expressive function approximation, we see that it can provide more than just task agnostic exploration. In fact, it can provide a directed notion of reward shaping, which guides an agent towards the goal rather than simply encouraging it to expand the visited region naively. As visualized below, CNML encourages exploration by giving optimistic success probabilities in less-visited regions, while also providing better shaping towards the goal.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_9.png" width="100%" />
<br />
</p>
<p>As we will show in our experimental results, this intuition scales to higher dimensional problems and more complex state and action spaces, enabling CNML based rewards to solve significantly more challenging tasks than is possible with typical classifier based rewards.</p>
<p>However, on closer inspection of the CNML procedure, a major challenge becomes apparent. Each time a query is made to the CNML classifier, $k$ different maximum likelihood problems need to be solved to convergence, then normalized to produce the desired likelihood. As the size of the dataset increases, as it naturally does in reinforcement learning, this becomes a prohibitively slow process. In fact, as seen in Table 1, RL with standard CNML based rewards takes around 4 hours to train a single epoch (1000 timesteps). Following this procedure blindly would take over a month to train a single RL agent, necessitating a more time efficient solution. This is where we find meta-learning to be a crucial tool.</p>
<h1 id="meta-learning-cnml-classifiers">Meta-Learning CNML Classifiers</h1>
<p>Meta-learning is a tool that has seen a lot of use cases in few-shot learning for image classification, learning quicker optimizers and even learning more efficient RL algorithms. In essence, the idea behind meta-learning is to leverage a set of “meta-training” tasks to learn a model (and often an adaptation procedure) that can very quickly adapt to a new task drawn from the same distribution of problems.</p>
<p>Meta-learning techniques are particularly well suited to our class of computational problems since it involves quickly solving multiple different maximum likelihood problems to evaluate the CNML likelihood. Each the maximum likelihood problems share significant similarities with each other, enabling a meta-learning algorithm to very quickly adapt to produce solutions for each individual problem. In doing so, meta-learning provides us an effective tool for producing estimates of normalized maximum likelihood significantly more quickly than possible before.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_10.gif" width="100%" />
<br />
</p>
<p>The intuition behind how to apply meta-learning to the CNML (meta-NML) can be understood by the graphic above. For a data-set of $N$ points, meta-NML would first construct $2N$ tasks, corresponding to the positive and negative maximum likelihood problems for each datapoint in the dataset. Given these constructed tasks as a (meta) training set, a <a href="https://arxiv.org/abs/1703.03400">meta</a>-<a href="https://arxiv.org/abs/1703.05175">learning</a> algorithm can be applied to learn a model that can very quickly be adapted to produce solutions to any of these $2N$ maximum likelihood problems. Equipped with this scheme to very quickly solve maximum likelihood problems, producing CNML predictions around $400$x faster than possible before. Prior work studied this problem from a Bayesian approach, but we found that it often scales poorly for the problems we considered.</p>
<p>Equipped with a tool for efficiently producing predictions from the CNML distribution, we can now return to the goal of solving outcome-driven RL with uncertainty aware classifiers, resulting in an algorithm we call MURAL.</p>
<h1 id="mural-meta-learning-uncertainty-aware-rewards-for-automated-reinforcement-learning">MURAL: Meta-Learning Uncertainty-Aware Rewards for Automated Reinforcement Learning</h1>
<p>To more effectively solve outcome driven RL problems, we incorporate meta-NML into the standard classifier based procedure as follows:
After each epoch of RL, we sample a batch of $n$ points from the replay buffer and use them to construct $2n$ meta-tasks. We then run $1$ iteration of meta-training on our model.
We assign rewards using NML, where the NML outputs are approximated using only one gradient step for each input point.</p>
<p>The resulting algorithm, which we call MURAL, replaces the classifier portion of standard classifier-based RL algorithms with a meta-NML model instead. Although meta-NML can only evaluate input points one at a time instead of in batches, it is substantially faster than naive CNML, and MURAL is still comparable in runtime to standard classifier-based RL, as shown in Table 1 below.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_11.png" width="60%" />
<br />
<i>Table 1. Runtimes for a single epoch of RL on the 2D maze task.</i>
</p>
<p>We evaluate MURAL on a variety of navigation and robotic manipulation tasks, which present several challenges including local optima and difficult exploration. MURAL solves all of these tasks successfully, outperforming prior classifier-based methods as well as standard RL with exploration bonuses.</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_12.gif" width="20%" />
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_13.gif" width="20%" />
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_14.gif" width="20%" />
<br />
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_15.gif" width="20%" />
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_16.gif" width="20%" />
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_17.gif" width="20%" />
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_18.gif" width="20%" />
<br />
<i>Visualization of behaviors learned by MURAL. MURAL is able to perform a variety of behaviors in navigation and manipulation tasks, inferring rewards from outcome examples.</i>
</p>
<p style="text-align:center;">
<img src="https://bair.berkeley.edu/static/blog/mural/MURAL_19.png" width="100%" />
<br />
<i>Quantitative comparison of MURAL to baselines. MURAL is able to outperform baselines which perform task-agnostic exploration, standard maximum likelihood classifiers.</i>
</p>
<p>This suggests that using meta-NML based classifiers for outcome driven RL provides us an effective way to provide rewards for RL problems, providing benefits both in terms of exploration and directed reward shaping.</p>
<h1 id="takeaways">Takeaways</h1>
<p>In conclusion, we showed how outcome driven RL can define a class of more tractable RL problems. Standard methods using classifiers can often fall short in these settings as they are unable to provide any benefits of exploration or guidance towards the goal. Leveraging a scheme for training uncertainty aware classifiers via conditional normalized maximum likelihood allows us to more effectively solve this problem, providing benefits in terms of exploration and reward shaping towards successful outcomes. The general principles defined in this work suggest that considering tractable approximations to the general RL problem may allow us to simplify the challenge of reward specification and exploration in RL while still encompassing a rich class of control problems.</p>
<hr />
<p><i> This post is based on the paper “<a href="https://arxiv.org/abs/2107.07184">MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning</a>”, which was presented at ICML 2021. You can see results <a href="https://sites.google.com/view/mural-rl">on our website</a>, and we <a href="https://github.com/mural-rl/mural">provide code</a> to reproduce our experiments.</i></p>
Fri, 22 Oct 2021 03:00:00 -0700
http://bair.berkeley.edu/blog/2021/10/22/mural/
http://bair.berkeley.edu/blog/2021/10/22/mural/