Our method learns complex behaviors by training offline from prior datasets
(expert demonstrations, data from previous experiments, or random exploration
data) and then fine-tuning quickly with online interaction.
Robots trained with reinforcement learning (RL) have the potential to be used
across a huge variety of challenging real world problems. To apply RL to a new
problem, you typically set up the environment, define a reward function, and
train the robot to solve the task by allowing it to explore the new environment
from scratch. While this may eventually work, these “online” RL methods are
data hungry and repeating this data inefficient process for every new problem
makes it difficult to apply online RL to real world robotics problems. What if
instead of repeating the data collection and learning process from scratch
every time, we were able to reuse data across multiple problems or experiments?
By doing so, we could greatly reduce the burden of data collection with every
new problem that is encountered. With hundreds to thousands of robot
experiments being constantly run, it is of crucial importance to devise an RL
paradigm that can effectively use the large amount of already available data
while still continuing to improve behavior on new tasks.
The first step towards moving RL towards a data driven paradigm is to consider
the general idea of offline (batch) RL. Offline RL considers the problem of
learning optimal policies from arbitrary off-policy data, without any further
exploration. This is able to eliminate the data collection problem in RL, and
incorporate data from arbitrary sources including other robots or
teleoperation. However, depending on the quality of available data and the
problem being tackled, we will often need to augment offline training with
targeted online improvement. This problem setting actually has unique
challenges of its own. In this blog post, we discuss how we can move RL from
training from scratch with every new problem to a paradigm which is able to
reuse prior data effectively, with some offline training followed by online
Editor’s Note: The following blog is a special guest post by a recent graduate
of Berkeley BAIR’s AI4ALL summer program for high school students.
AI4ALL is a nonprofit dedicated to increasing diversity and inclusion in AI
education, research, development, and policy.
The idea for AI4ALL began in early 2015 with Prof. Olga Russakovsky, then
a Stanford University Ph.D. student, AI researcher Prof. Fei-Fei Li, and Rick
Sommer – Executive Director of Stanford Pre-Collegiate Studies. They founded
SAILORS as a summer outreach program for high school girls to learn about
human-centered AI, which later became AI4ALL. In 2016, Prof. Anca Dragan
started the Berkeley/BAIR AI4ALL camp, geared towards high school students from
The case fatality rate quantifies how dangerous COVID-19 is, and how risk of death varies with strata
like geography, age, and race. Current estimates of the COVID-19 case fatality rate (CFR) are biased
for dozens of reasons, from under-testing of asymptomatic cases to government misreporting. We provide
a careful and comprehensive overview of these biases and show how statistical thinking and modeling can
combat such problems. Most importantly, data quality is key to unbiased CFR estimation. We show that a
relatively small dataset collected via careful contact tracing would enable simple and potentially more
accurate CFR estimation.
Despite recent advances in artificial intelligence (AI) research, human
children are still by far the best learners we know of, learning impressive
skills like language and high-level reasoning from very little data. Children’s
learning is supported by highly efficient, hypothesis-driven exploration: in
fact, they explore so well that many machine learning researchers have been
inspired to put videos like the one below in their talks to motivate research
into exploration methods. However, because applying results from studies in
developmental psychology can be difficult, this video is often the extent to
which such research actually connects with human cognition.
A time-lapse of a baby playing with toys. Source.
A remarkable characteristic of human intelligence is our ability to learn tasks
quickly. Most humans can learn reasonably complex skills like tool-use and
gameplay within just a few hours, and understand the basics after only a few
attempts. This suggests that data-efficient learning may be a meaningful part
of developing broader intelligence.
On the other hand, Deep Reinforcement Learning (RL) algorithms can achieve
superhuman performance on games like Atari, Starcraft, Dota, and Go, but
require large amounts of data to get there. Achieving superhuman performance on
Dota took over 10,000 human years of gameplay. Unlike simulation, skill
acquisition in the real-world is constrained to wall-clock time. In order to
see similar breakthroughs to AlphaGo in real-world settings, such as robotic
manipulation and autonomous vehicle navigation, RL algorithms need to be
data-efficient — they need to learn effective policies within a reasonable
amount of time.
To date, it has been commonly assumed that RL operating on coordinate state is
significantly more data-efficient than pixel-based RL. However, coordinate
state is just a human crafted representation of visual information. In
principle, if the environment is fully observable, we should also be able to
learn representations that capture the state.
Many neural network architectures that underlie various artificial intelligence systems today bear an interesting similarity to the early computers a century ago.
Just as early computers were specialized circuits for specific purposes like solving linear systems or cryptanalysis, so too does the trained neural network generally function as a specialized circuit for performing a specific task, with all parameters coupled together in the same global scope.
One might naturally wonder what it might take for learning systems to scale in complexity in the same way as programmed systems have.
And if the history of how abstraction enabled computer science to scale gives any indication, one possible place to start would be to consider what it means to build complex learning systems at multiple levels of abstraction, where each level of learning is the emergent consequence of learning from the layer below.
This post discusses our recent paper that introduces a framework for societal decision-making, a perspective on reinforcement learning through the lens of a self-organizing society of primitive agents.
We prove the optimality of an incentive mechanism for engineering the society to optimize a collective objective.
Our work also provides suggestive evidence that the local credit assignment scheme of the decentralized reinforcement learning algorithms we develop to train the society facilitates more efficient transfer to new tasks.
In the last decade, one of the biggest drivers for success in machine learning has arguably been the rise of high-capacity models such as neural networks along with large datasets such as ImageNet to produce accurate models. While we have seen deep neural networks being applied to success in reinforcement learning (RL) in domains such as robotics, poker, board games, and team-based video games, a significant barrier to getting these methods working on real-world problems is the difficulty of large-scale online data collection. Not only is online data collection time-consuming and expensive, it can also be dangerous in safety-critical domains such as driving or healthcare. For example, it would be unreasonable to allow reinforcement learning agents to explore, make mistakes, and learn while controlling an autonomous vehicle or treating patients in a hospital. This makes learning from pre-collected experience enticing, and we are fortunate in that many of these domains, there already exist large datasets for applications such as self-driving cars, healthcare, or robotics. Therefore, the ability for RL algorithms to learn offline from these datasets (a setting referred to as offline or batch RL) has an enormous potential impact in shaping the way we build machine learning systems for the future.
The World is Continuously Varying
Imagine we want to train a self-driving car in New York so that we can take it
all the way to Seattle without tediously driving it for over 48 hours. We hope
our car can handle all kinds of environments on the trip and send us safely to
the destination. We know that road conditions and views can be very different.
It is intuitive to simply collect road data of this trip, let the car learn
from every possible condition, and hope it becomes the perfect self-driving car
for our New York to Seattle trip. It needs to understand the traffic and
skyscrapers in big cities like New York and Chicago, more unpredictable weather
in Seattle, mountains and forests in Montana, and all kinds of country views,
farmlands, animals, etc. However, how much data is enough? How many cities
should we collect data from? How many weather conditions should we consider? We
never know, and these questions never stop.
Figure 1: Domains boundaries are rarely clear. Therefore, it is hard to set up
definite domain descriptions for all possible domains.
Human thumb next to our OmniTact sensor, and a US penny for scale.
Touch has been shown to be important for dexterous manipulation in
robotics. Recently, the GelSight sensor has caught significant interest
for learning-based robotics due to its low cost and rich signal. For example,
GelSight sensors have been used for learning inserting USB cables (Li et al,
2014), rolling a die (Tian et al. 2019) or grasping objects (Calandra
et al. 2017).
The reason why learning-based methods work well with GelSight sensors is that
they output high-resolution tactile images from which a variety of features
such as object geometry, surface texture, normal and shear forces can be
estimated that often prove critical to robotic control. The tactile images
can be fed into standard CNN-based computer vision pipelines allowing the use
of a variety of different learning-based techniques: In Calandra et al.
2017 a grasp-success classifier is trained on GelSight data collected in
self-supervised manner, in Tian et al. 2019 Visual Foresight, a
video-prediction-based control algorithm is used to make a robot roll a die
purely based on tactile images, and in Lambeta et al. 2020 a model-based
RL algorithm is applied to in-hand manipulation using GelSight images.
Unfortunately applying GelSight sensors in practical real-world scenarios is
still challenging due to its large size and the fact that it is only sensitive
on one side. Here we introduce a new, more compact tactile sensor design based
on GelSight that allows for omnidirectional sensing, i.e. making the sensor
sensitive on all sides like a human finger, and show how this opens up new
possibilities for sensorimotor learning. We demonstrate this by teaching a
robot to pick up electrical plugs and insert them purely based on tactile
Humans manipulate 2D deformable structures such as fabric on a daily basis,
from putting on clothes to making beds. Can robots learn to perform similar
tasks? Successful approaches can advance applications such as dressing
assistance for senior care, folding of laundry, fabric upholstery, bed-making,
manufacturing, and other tasks. Fabric manipulation is challenging, however,
because of the difficulty in modeling system states and dynamics, meaning that
when a robot manipulates fabric, it is hard to predict the fabric’s resulting
state or visual appearance.
In this blog post, we review four recent papers from two research labs (Pieter
Abbeel’s and Ken Goldberg’s) at Berkeley AI Research (BAIR) that
investigate the following hypothesis: is it possible to employ learning-based
approaches to the problem of fabric manipulation?
We demonstrate promising results in support of this hypothesis by using a
variety of learning-based methods with fabric simulators to train smoothing
(and even folding) policies in simulation. We then perform sim-to-real transfer
to deploy the policies on physical robots. Examples of the learned policies in
action are shown in the GIFs above.
We show that deep model-free methods trained from exploration or from
demonstrations work reasonably well for specific tasks like smoothing, but it
is unclear how well they generalize to related tasks such as folding. On the
other hand, we show that deep model-based methods have more potential for
generalization to a variety of tasks, provided that the learned models are
sufficiently accurate. In the rest of this post, we summarize the papers,
emphasizing the techniques and tradeoffs in each approach.