An earlier version of this post was published on Off the Convex
Path. It is reposted here with the
In the last few years, deep learning practitioners have proposed a litany of
different sequence models. Although recurrent neural networks were once the
tool of choice, now models like the autoregressive
Wavenet or the
are replacing RNNs on a diverse set of tasks. In this post, we explore the
trade-offs between recurrent and feed-forward models. Feed-forward models can
offer improvements in training stability and speed, while recurrent models are
strictly more expressive. Intriguingly, this added expressivity does not seem to
boost the performance of recurrent models. Several groups have shown
feed-forward networks can match the results of the best recurrent models on
benchmark sequence tasks. This phenomenon raises an interesting question for
When and why can feed-forward networks replace recurrent neural networks
without a loss in performance?
We discuss several proposed answers to this question and highlight our
recent work that offers an explanation in
terms of a fundamental stability property.
Learning a new skill by observing another individual, the ability to imitate, is
a key part of intelligence in human and animals. Can we enable a robot to do the
same, learning to manipulate a new object by simply watching a human
manipulating the object just as in the video below?
The robot learns to place the peach into the red bowl after watching the human
We are excited by the interest and excitement generated by our BDD100K dataset.
Our data release and blog post were covered in an unsolicited article by
the UC Berkeley newspaper, the Daily Cal, which was then picked up by other news
services without our prompting or intervention. The paper describing this
dataset is under review at the ECCV 2018 conference, and we followed the rules
of that conference (as communicated to us by the Program Chairs in prompt email
response when we asked for clarification following the reporter’s request; the
ECCV PC’s replied that ECCV follows CVPR’s long-standing policy). We thus
declined to speak to the reporters after they reached out to us. We did not, and
have not, communicated with any media outlets regarding this story.
While the Daily Cal article was accurate; unfortunately, other media outlets who
followed in reporting the story made claims that were attributed to us
incorrectly, and which do not represent our view. In particular, several media
outlets attributed to us a claim that the BDD100K dataset was “800 times” bigger
than other industrial datasets, specifically mentioning Baidu’s ApolloScape.
While it is true our dataset does contain more raw images than other datasets,
including Baidu’s, the stated claim is misleading and we did not put that line
or anything like it in a paper, blog post, or spoken comment to anyone. It
appears that some reporters(s) viewed the data in tables in our paper and came
up with this conclusory comment themselves as it made an exciting headline, yet
attributed it to us. In fact, it is inappropriate in our view to summarize the
difference between our dataset and Baidu’s in a single comment that ours is 800x
larger. Comparing the number of raw images directly is not the most appropriate
way to compare these types of datasets.
Importantly, different datasets focus on different aspects of the autonomous
driving challenge. Our dataset is crowd-sourced, and covers a very large area
and diverse visual phenomena (indeed significantly more diverse than previous
efforts, in our view), but it is very clearly limited to monocular RGB image
data and associated mobile device metadata. Other dataset collection efforts are
complementary in our view. Baidu’s, KITTI, and CityScapes each contain important
additional sensing modalities and are collected with fully calibrated apparatus
including actuation channels. (The dataset from Mapillary is also notable, and
similar to ours in being diverse, crowd-sourced, and densely annotated, but
differs in that we include video and dynamic metadata relevant to driving
control.) We look forward to projects at Berkeley and elsewhere that leverage
both BDD100K and these other datasets as the research community brings the
potential of autonomous driving to reality.
Machine learning systems trained to minimize prediction error may often exhibit
discriminatory behavior based on sensitive characteristics such as race and
gender. One reason could be due to historical bias in the data. In various
application domains including lending, hiring, criminal justice, and
advertising, machine learning has been criticized for its potential to harm
historically underrepresented or disadvantaged groups.
In this post, we talk about our recent work on aligning decisions made by
machine learning with long term social welfare goals. Commonly, machine learning
models produce a score that summarizes information about an individual in
order to make decisions about them. For example, a credit score summarizes an
individual’s credit history and financial activities in a way that informs the
bank about their creditworthiness. Let us continue to use the lending setting as
a running example.
You’ve decided that you want to bike from your house by UC Berkeley to the
Golden Gate Bridge. It’s a nice 20 mile ride, but there’s a problem: you’ve
never ridden a bike before! To make matters worse, you are new to the Bay Area,
and all you have is a good ol’ fashion map to guide you. How do you get started?
Let’s first figure out how to ride a bike. One strategy would be to do a lot of
studying and planning. Read books on how to ride bicycles. Study physics and
anatomy. Plan out all the different muscle movements that you’ll make in
response to each perturbation. This approach is noble, but for anyone who’s ever
learned to ride a bike, they know that this strategy is doomed to fail. There’s
only one way to learn how to ride a bike: trial and error. Some tasks like
riding a bike are just too complicated to plan out in your head.
Once you’ve learned how to ride your bike, how would you get to the Golden Gate
Bridge? You could reuse your trial-and-error strategy. Take a few random turns
and see if you end up at the Golden Gate Bridge. Unfortunately, this strategy
would take a very, very long time. For this sort of problem, planning is a much
faster strategy, and requires considerably less real-world experience and
trial-and-error. In reinforcement learning terms, it is more
Left: some skills you learn by trial and error. Right: other times, planning
ahead is better.
While simple, this thought experiment highlights some important aspects of human
intelligence. For some tasks, we use a trial-and-error approach, and for others
we use a planning approach. A similar phenomena seems to have emerged in
reinforcement learning (RL). In the parlance of RL, empirical results show that
some tasks are better suited for model-free (trial-and-error) approaches, and
others are better suited for model-based (planning) approaches.
However, the biking analogy also highlights that the two systems are not
completely independent. In particularly, to say that learning to ride a bike is
just trial-and-error is an oversimplification. In fact, when learning to
bike by trial-and-error, you’ll employ a bit of planning. Perhaps your plan will
initially be, “Don’t fall over.” As you improve, you’ll make more ambitious
plans, such as, “Bike forwards for two meters without falling over.” Eventually,
your bike-riding skills will be so proficient that you can start to plan in very
abstract terms (“Bike to the end of the road.”) to the point that all there is
left to do is planning and you no longer need to worry about the nitty-gritty
details of riding a bike. We see that there is a gradual transition from the
model-free (trial-and-error) strategy to a model-based (planning) strategy. If
we could develop artificial intelligence algorithms--and specifically RL
algorithms--that mimic this behavior, it could result in an algorithm that both
performs well (by using trial-and-error methods early on) and is sample
efficient (by later switching to a planning approach to achieve more abstract
This post covers temporal difference model (TDM), which is a RL algorithm that
captures this smooth transition between model-free and model-based RL. Before
describing TDMs, we start by first describing how a typical model-based RL
A blind, autonomous pilot (left), suboptimal human pilot (center), and combined human-machine team (right) play the Lunar Lander game.
Imagine a drone pilot remotely flying a quadrotor, using an onboard camera to navigate and land. Unfamiliar flight dynamics, terrain, and network latency can make this system challenging for a human to control. One approach to this problem is to train an autonomous agent to perform tasks like patrolling and mapping without human intervention. This strategy works well when the task is clearly specified and the agent can observe all the information it needs to succeed. Unfortunately, many real-world applications that involve human users do not satisfy these conditions: the user's intent is often private information that the agent cannot directly access, and the task may be too complicated for the user to precisely define. For example, the pilot may want to track a set of moving objects (e.g., a herd of animals) and change object priorities on the fly (e.g., focus on individuals who unexpectedly appear injured). Shared autonomy addresses this problem by combining user input with automated assistance; in other words, augmenting human control instead of replacing it.
Simulated humanoid performing a variety of highly dynamic and acrobatic skills.
Motion control problems have become standard benchmarks for reinforcement
learning, and deep RL methods have been shown to be effective for a diverse
suite of tasks ranging from manipulation to locomotion. However, characters
trained with deep RL often exhibit unnatural behaviours, bearing artifacts such
as jittering, asymmetric gaits, and excessive movement of
limbs. Can we train our characters to produce more natural behaviours?
Left: Given movie poster, Right: New movie title generated by MC-GAN.
Text is a prominent visual element of 2D design. Artists invest significant time
into designing glyphs that are visually compatible with other elements in their
shape and texture. This process is labor intensive and artists often design only
the subset of glyphs that are necessary for a title or an annotation, which
makes it difficult to alter the text after the design is created, or to transfer
an observed instance of a font to your own project.
Early research on glyph synthesis focused on geometric modeling of outlines,
which is limited to particular glyph topology (e.g., cannot be applied to
decorative or hand-written glyphs) and cannot be used with image input.
With the rise of deep neural networks, researchers have looked at modeling
glyphs from images. On the other hand, synthesizing data consistent with
partial observations is an interesting problem in computer vision and graphics
such as multi-view image generation, completing missing regions in images,
and generating 3D shapes. Font data is an example that provides a clean factorization
of style and content.
Recent advances in conditional generative adversarial networks (cGANS)  have
been successful in many generative applications. However, they do best only with
fairly specialized domains and not with general or multi-domain style transfer.
Similarly, when directly used to generate fonts, cGAN models produce significant
artifacts. For instance, given the following five letters,
a conditional GAN model is not successful in generating all 26 letters with the same style:
Humans physically interact with each other every day – from grabbing someone’s hand when they are about to spill their drink, to giving your friend a nudge to steer them in the right direction, physical interaction is an intuitive way to convey information about personal preferences and how to perform a task correctly.
So why aren’t we physically interacting with current robots the way we do with each other? Seamless physical interaction between a human and a robot requires a lot: lightweight robot designs, reliable torque or force sensors, safe and reactive control schemes, the ability to predict the intentions of human collaborators, and more! Luckily, robotics has made many advances in the design of personal robots specifically developed with humans in mind.
However, consider the example from the beginning where you grab your friend’s hand as they are about to spill their drink. Instead of your friend who is spilling, imagine it was a robot. Because state-of-the-art robot planning and control algorithms typically assume human physical interventions are disturbances, once you let go of the robot, it will resume its erroneous trajectory and continue spilling the drink. The key to this gap comes from how robots reason about physical interaction: instead of thinking about why the human physically intervened and replanning in accordance with what the human wants, most robots simply resume their original behavior after the interaction ends.
We argue that robots should treat physical human interaction as useful information about how they should be doing the task. We formalize reacting to physical interaction as an objective (or reward) learning problem and propose a solution that enables robots to change their behaviors while they are performing a task according to the information gained during these interactions.