Learning State Abstractions for Long-Horizon Planning


Many tasks that we do on a regular basis, such as navigating a city, cooking a meal, or loading a dishwasher, require planning over extended periods of time. Accomplishing these tasks may seem simple to us; however, reasoning over long time horizons remains a major challenge for today’s Reinforcement Learning (RL) algorithms. While unable to plan over long horizons, deep RL algorithms excel at learning policies for short horizon tasks, such as robotic grasping, directly from pixels. At the same time, classical planning methods such as Dijkstra’s algorithm and A$^*$ search can plan over long time horizons, but they require hand-specified or task-specific abstract representations of the environment as input.

To achieve the best of both worlds, state-of-the-art visual navigation methods have applied classical search methods to learned graphs. In particular, SPTM [2] and SoRB [3] use a replay buffer of observations as nodes in a graph and learn a parametric distance function to draw edges in the graph. These methods have been successfully applied to long-horizon simulated navigation tasks that were too challenging for previous methods to solve.


EvolveGraph: Dynamic Neural Relational Reasoning for Interacting Systems


Multi-agent interacting systems are prevalent in the world, from purely physical systems to complicated social dynamic systems. The interactions between entities / components can give rise to very complex behavior patterns at the level of both individuals and the multi-agent system as a whole. Since usually only the trajectories of individual entities are observed without any knowledge of the underlying interaction patterns, and there are usually multiple possible modalities for each agent with uncertainty, it is challenging to model their dynamics and forecast their future behaviors.

Figure 1. Typical multi-agent interacting systems.

In many real-world applications (e.g. autonomous vehicles, mobile robots), an effective understanding of the situation and accurate trajectory prediction of interactive agents play a significant role in downstream tasks, such as decision making and planning. We introduce a generic trajectory forecasting framework (named EvolveGraph) with explicit relational structure recognition and prediction via latent interaction graphs among multiple heterogeneous, interactive agents. Considering the uncertainty of future behaviors, the model is designed to provide multi-modal prediction hypotheses. Since the underlying interactions may evolve even with abrupt changes over time, and different modalities of evolution may lead to different outcomes, we address the necessity of dynamic relational reasoning and adaptively evolving the interaction graphs.


Training on Test Inputs with Amortized Conditional Normalized Maximum Likelihood


Current machine learning methods provide unprecedented accuracy across a range of domains, from computer vision to natural language processing. However, in many important high-stakes applications, such as medical diagnosis or autonomous driving, rare mistakes can be extremely costly, and thus effective deployment of learned models requires not only high accuracy, but also a way to measure the certainty in a model’s predictions. Reliable uncertainty quantification is especially important when faced with out-of-distribution inputs, as model accuracy tends to degrade heavily on inputs that differ significantly from those seen during training. In this blog post, we will discuss how we can get reliable uncertainty estimation with a strategy that does not simply rely on a learned model to extrapolate to out-of-distribution inputs, but instead asks: “given my training data, which labels would make sense for this input?”.


Goodhart’s Law, Diversity and a Series of Seemingly Unrelated Toy Problems


Goodhart’s Law is an adage which states the following:

“When a measure becomes a target, it ceases to be a good measure.”

This is particularly pertinent in machine learning, where the source of many of our greatest achievements comes from optimizing a target in the form of a loss function. The most prominent way to do so is with stochastic gradient descent (SGD), which applies a simple rule, follow the gradient:

For some step size $\alpha$. Updates of this form have led to a series of breakthroughs from computer vision to reinforcement learning, and it is easy to see why it is so popular: 1) it is relatively cheap to compute using backprop 2) it is guaranteed to locally reduce the loss at every step and finally 3) it has an amazing track record empirically.


Adapting on the Fly to Test Time Distribution Shift


Imagine that you are building the next generation machine learning model for handwriting transcription. Based on previous iterations of your product, you have identified a key challenge for this rollout: after deployment, new end users often have different and unseen handwriting styles, leading to distribution shift. One solution for this challenge is to learn an adaptive model that can specialize and adjust to each user’s handwriting style over time. This solution seems promising, but it must be balanced against concerns about ease of use: requiring users to provide feedback to the model may be cumbersome and hinder adoption. Is it possible instead to learn a model that can adapt to new users without labels?


Reinforcement learning is supervised learning on optimized data


The two most common perspectives on Reinforcement learning (RL) are optimization and dynamic programming. Methods that compute the gradients of the non-differentiable expected reward objective, such as the REINFORCE trick are commonly grouped into the optimization perspective, whereas methods that employ TD-learning or Q-learning are dynamic programming methods. While these methods have shown considerable success in recent years, these methods are still quite challenging to apply to new problems. In contrast deep supervised learning has been extremely successful and we may hence ask: Can we use supervised learning to perform RL?

In this blog post we discuss a mental model for RL, based on the idea that RL can be viewed as doing supervised learning on the “good data”. What makes RL challenging is that, unless you’re doing imitation learning, actually acquiring that “good data” is quite challenging. Therefore, RL might be viewed as a joint optimization problem over both the policy and the data. Seen from this supervised learning perspective, many RL algorithms can be viewed as alternating between finding good data and doing supervised learning on that data. It turns out that finding “good data” is much easier in the multi-task setting, or settings that can be converted to a different problem for which obtaining “good data” is easy. In fact, we will discuss how techniques such as hindsight relabeling and inverse RL can be viewed as optimizing data.


Plan2Explore: Active Model-Building for Self-Supervised Visual Reinforcement Learning


This post is cross-listed on the CMU ML blog.

To operate successfully in unstructured open-world environments, autonomous intelligent agents need to solve many different tasks and learn new tasks quickly. Reinforcement learning has enabled artificial agents to solve complex tasks both in simulation and real-world. However, it requires collecting large amounts of experience in the environment, and the agent learns only that particular task, much like a student memorizing a lecture without understanding. Self-supervised reinforcement learning has emerged as an alternative, where the agent only follows an intrinsic objective that is independent of any individual task, analogously to unsupervised representation learning. After experimenting with the environment without supervision, the agent builds an understanding of the environment, which enables it to adapt to specific downstream tasks more efficiently.

In this post, we explain our recent publication that develops Plan2Explore. While many recent papers on self-supervised reinforcement learning have focused on model-free agents that can only capture knowledge by remembering behaviors practiced during self-supervision, our agent learns an internal world model that lets it extrapolate beyond memorized facts by predicting what will happen as a consequence of different potential actions. The world model captures general knowledge, allowing Plan2Explore to quickly solve new tasks through planning in its own imagination. In contrast to the model-free prior work, the world model further enables the agent to explore what it expects to be novel, rather than repeating what it found novel in the past. Plan2Explore obtains state-of-the-art zero-shot and few-shot performance on continuous control benchmarks with high-dimensional input images. To make it easy to experiment with our agent, we are open-sourcing the complete source code.


AWAC: Accelerating Online Reinforcement Learning with Offline Datasets


Our method learns complex behaviors by training offline from prior datasets (expert demonstrations, data from previous experiments, or random exploration data) and then fine-tuning quickly with online interaction.

Robots trained with reinforcement learning (RL) have the potential to be used across a huge variety of challenging real world problems. To apply RL to a new problem, you typically set up the environment, define a reward function, and train the robot to solve the task by allowing it to explore the new environment from scratch. While this may eventually work, these “online” RL methods are data hungry and repeating this data inefficient process for every new problem makes it difficult to apply online RL to real world robotics problems. What if instead of repeating the data collection and learning process from scratch every time, we were able to reuse data across multiple problems or experiments? By doing so, we could greatly reduce the burden of data collection with every new problem that is encountered. With hundreds to thousands of robot experiments being constantly run, it is of crucial importance to devise an RL paradigm that can effectively use the large amount of already available data while still continuing to improve behavior on new tasks.

The first step towards moving RL towards a data driven paradigm is to consider the general idea of offline (batch) RL. Offline RL considers the problem of learning optimal policies from arbitrary off-policy data, without any further exploration. This is able to eliminate the data collection problem in RL, and incorporate data from arbitrary sources including other robots or teleoperation. However, depending on the quality of available data and the problem being tackled, we will often need to augment offline training with targeted online improvement. This problem setting actually has unique challenges of its own. In this blog post, we discuss how we can move RL from training from scratch with every new problem to a paradigm which is able to reuse prior data effectively, with some offline training followed by online finetuning.


AI Will Change the World.
Who Will Change AI?
We Will.


Editor’s Note: The following blog is a special guest post by a recent graduate of Berkeley BAIR’s AI4ALL summer program for high school students.

AI4ALL is a nonprofit dedicated to increasing diversity and inclusion in AI education, research, development, and policy.

The idea for AI4ALL began in early 2015 with Prof. Olga Russakovsky, then a Stanford University Ph.D. student, AI researcher Prof. Fei-Fei Li, and Rick Sommer – Executive Director of Stanford Pre-Collegiate Studies. They founded SAILORS as a summer outreach program for high school girls to learn about human-centered AI, which later became AI4ALL. In 2016, Prof. Anca Dragan started the Berkeley/BAIR AI4ALL camp, geared towards high school students from underserved communities.


Estimating the fatality rate is difficult but doable with better data


The case fatality rate quantifies how dangerous COVID-19 is, and how risk of death varies with strata like geography, age, and race. Current estimates of the COVID-19 case fatality rate (CFR) are biased for dozens of reasons, from under-testing of asymptomatic cases to government misreporting. We provide a careful and comprehensive overview of these biases and show how statistical thinking and modeling can combat such problems. Most importantly, data quality is key to unbiased CFR estimation. We show that a relatively small dataset collected via careful contact tracing would enable simple and potentially more accurate CFR estimation.