Our method learns a task in a fixed, simulated environment and quickly adapts
to new environments (e.g. the real world) solely from online interaction during
The ability for humans to generalize their knowledge and experiences to new
situations is remarkable, yet poorly understood. For example, imagine a human
driver that has only ever driven around their city in clear weather. Even
though they never encountered true diversity in driving conditions, they have
acquired the fundamental skill of driving, and can adapt reasonably fast to
driving in neighboring cities, in rainy or windy weather, or even driving a
different car, without much practice nor additional driver’s lessons. While
humans excel at adaptation, building intelligent systems with common-sense
knowledge and the ability to quickly adapt to new situations is a long-standing
problem in artificial intelligence.
The Successor Representation, Gamma-Models, and Infinite-Horizon Prediction
Standard single-step models have a horizon of one. This post describes a method for training predictive dynamics models in continuous state spaces with an infinite, probabilistic horizon.
Reinforcement learning algorithms are frequently categorized by whether they predict future states at any point in their decision-making process. Those that do are called model-based, and those that do not are dubbed model-free. This classification is so common that we mostly take it for granted these days; I am guilty of using it myself. However, this distinction is not as clear-cut as it may initially seem.
In this post, I will talk about an alternative view that emphases the mechanism of prediction instead of the content of prediction. This shift in focus brings into relief a space between model-based and model-free methods that contains exciting directions for reinforcement learning. The first half of this post describes some of the classic tools in this space, including
generalized value functions and the successor representation. The latter half is based on our recent paper about infinite-horizon predictive models, for which code is available here.
Most likely not.
Yet, OpenAI’s GPT-2 language model does know how to reach a certain Peter W--- (name redacted for privacy). When prompted with a short snippet of Internet text, the model accurately generates Peter’s contact information, including his work address, email, phone, and fax:
In our recent paper, we evaluate how large language models memorize and regurgitate such rare snippets of their training data. We focus on GPT-2 and find that at least 0.1% of its text generations (a very conservative estimate) contain long verbatim strings that are “copy-pasted” from a document in its training set.
Such memorization would be an obvious issue for language models that are trained on private data, e.g., on users’ emails, as the model might inadvertently output a user’s sensitive conversations. Yet, even for models that are trained on public data from the Web (e.g., GPT-2, GPT-3, T5, RoBERTa, TuringNLG), memorization of training data raises multiple challenging regulatory questions, ranging from misuse of personally identifiable information to copyright infringement.
Deep reinforcement learning has made significant progress in the last few years, with success stories in robotic control, game playing and science problems. While RL methods present a general paradigm where an agent learns from its own interaction with an environment, this requirement for “active” data collection is also a major hindrance in the application of RL methods to real-world problems, since active data collection is often expensive and potentially unsafe. An alternative “data-driven” paradigm of RL, referred to as offline RL (or batch RL) has recently regained popularity as a viable path towards effective real-world RL. As shown in the figure below, offline RL requires learning skills solely from previously collected datasets, without any active environment interaction. It provides a way to utilize previously collected datasets from a variety of sources, including human demonstrations, prior experiments, domain-specific solutions and even data from different but related problems, to build complex decision-making engines.
Many tasks that we do on a regular basis, such as navigating a city, cooking a
meal, or loading a dishwasher, require planning over extended periods of time.
Accomplishing these tasks may seem simple to us; however, reasoning over long
time horizons remains a major challenge for today’s Reinforcement Learning (RL)
algorithms. While unable to plan over long horizons, deep RL algorithms excel
at learning policies for short horizon tasks, such as robotic grasping,
directly from pixels. At the same time, classical planning methods such as
Dijkstra’s algorithm and A$^*$ search can plan over long time horizons, but
they require hand-specified or task-specific abstract representations of the
environment as input.
To achieve the best of both worlds, state-of-the-art visual navigation methods
have applied classical search methods to learned graphs. In particular, SPTM 
and SoRB  use a replay buffer of observations as nodes in a graph and learn
a parametric distance function to draw edges in the graph. These methods have
been successfully applied to long-horizon simulated navigation tasks that were
too challenging for previous methods to solve.
Multi-agent interacting systems are prevalent in the world, from purely physical systems to complicated social dynamic systems. The interactions between entities / components can give rise to very complex behavior patterns at the level of both individuals and the multi-agent system as a whole. Since usually only the trajectories of individual entities are observed without any knowledge of the underlying interaction patterns, and there are usually multiple possible modalities for each agent with uncertainty, it is challenging to model their dynamics and forecast their future behaviors.
Figure 1. Typical multi-agent interacting systems.
In many real-world applications (e.g. autonomous vehicles, mobile robots), an effective understanding of the situation and accurate trajectory prediction of interactive agents play a significant role in downstream tasks, such as decision making and planning. We introduce a generic trajectory forecasting framework (named EvolveGraph) with explicit relational structure recognition and prediction via latent interaction graphs among multiple heterogeneous, interactive agents. Considering the uncertainty of future behaviors, the model is designed to provide multi-modal prediction hypotheses. Since the underlying interactions may evolve even with abrupt changes over time, and different modalities of evolution may lead to different outcomes, we address the necessity of dynamic relational reasoning and adaptively evolving the interaction graphs.
Current machine learning methods provide unprecedented accuracy across a range
of domains, from computer vision to natural language processing. However, in
many important high-stakes applications, such as medical diagnosis or
autonomous driving, rare mistakes can be extremely costly, and thus effective
deployment of learned models requires not only high accuracy, but also a way to
measure the certainty in a model’s predictions. Reliable uncertainty
quantification is especially important when faced with out-of-distribution
inputs, as model accuracy tends to degrade heavily on inputs that differ
significantly from those seen during training. In this blog post, we will
discuss how we can get reliable uncertainty estimation with a strategy that
does not simply rely on a learned model to extrapolate to out-of-distribution
inputs, but instead asks: “given my training data, which labels would make
sense for this input?”.
Goodhart’s Law is an adage which states the following:
“When a measure becomes a target, it ceases to be a good measure.”
This is particularly pertinent in machine learning, where the source of many of
our greatest achievements comes from optimizing a target in the form of a loss
function. The most prominent way to do so is with stochastic gradient descent
(SGD), which applies a simple rule, follow the gradient:
For some step size $\alpha$. Updates of this form have led to a series of
breakthroughs from computer vision to reinforcement learning, and it is easy to
see why it is so popular: 1) it is relatively cheap to compute using backprop
2) it is guaranteed to locally reduce the loss at every step and finally 3) it
has an amazing track record empirically.
Imagine that you are building the next generation machine learning model for handwriting transcription. Based on previous iterations of your product, you have identified a key challenge for this rollout: after deployment, new end users often have different and unseen handwriting styles, leading to distribution shift. One solution for this challenge is to learn an adaptive model that can specialize and adjust to each user’s handwriting style over time. This solution seems promising, but it must be balanced against concerns about ease of use: requiring users to provide feedback to the model may be cumbersome and hinder adoption. Is it possible instead to learn a model that can adapt to new users without labels?
The two most common perspectives on Reinforcement learning (RL) are optimization and dynamic programming. Methods that compute the gradients of the non-differentiable expected reward objective, such as the REINFORCE trick are commonly grouped into the optimization perspective, whereas methods that employ TD-learning or Q-learning are dynamic programming methods. While these methods have shown considerable success in recent years, these methods are still quite challenging to apply to new problems. In contrast deep supervised learning has been extremely successful and we may hence ask: Can we use supervised learning to perform RL?
In this blog post we discuss a mental model for RL, based on the idea that RL can be viewed as doing supervised learning on the “good data”. What makes RL challenging is that, unless you’re doing imitation learning, actually acquiring that “good data” is quite challenging. Therefore, RL might be viewed as a joint optimization problem over both the policy and the data. Seen from this supervised learning perspective, many RL algorithms can be viewed as alternating between finding good data and doing supervised learning on that data. It turns out that finding “good data” is much easier in the multi-task setting, or settings that can be converted to a different problem for which obtaining “good data” is easy. In fact, we will discuss how techniques such as hindsight relabeling and inverse RL can be viewed as optimizing data.