Fig 1. A learned neural network dynamics model enables a hexapod robot to learn
to run and follow desired trajectories, using just 17 minutes of real-world
Enabling robots to act autonomously in the real-world is difficult. Really,
really difficult. Even with expensive robots and teams of world-class
researchers, robots still have difficulty autonomously navigating and
interacting in complex, unstructured environments.
Why are autonomous robots not out in the world among us? Engineering systems
that can cope with all the complexities of our world is hard. From nonlinear
dynamics and partial observability to unpredictable terrain and sensor
malfunctions, robots are particularly susceptible to Murphy’s law: everything
that can go wrong, will go wrong. Instead of fighting Murphy’s law by coding
each possible scenario that our robots may encounter, we could instead choose to
embrace this possibility for failure, and enable our robots to learn from it.
Learning control strategies from experience is advantageous because, unlike
hand-engineered controllers, learned controllers can adapt and improve with more
data. Therefore, when presented with a scenario in which everything does go
wrong, although the robot will still fail, the learned controller will hopefully
correct its mistake the next time it is presented with a similar scenario. In
order to deal with complexities of tasks in the real world, current
learning-based methods often use deep neural networks, which are powerful but
not data efficient: These trial-and-error based learners will most often still
fail a second time, and a third time, and often thousands to millions of times.
The sample inefficiency of modern deep reinforcement learning methods is one of
the main bottlenecks to leveraging learning-based methods in the real-world.
We have been investigating sample-efficient learning-based approaches with
neural networks for robot control. For complex and contact-rich simulated
robots, as well as real-world robots (Fig. 1), our approach is able to learn
locomotion skills of trajectory-following using only minutes of data collected
from the robot randomly acting in the environment. In this blog post, we’ll
provide an overview of our approach and results. More details can be found in
our research papers listed at the bottom of this post, including this paper
with code here.
Why we need Attention
What we see through our eyes is only a very small part of the world around us. At any given time our eyes are sampling only a fraction of the surrounding light field. Even within this fraction, most of the resolution is dedicated to the center of gaze which has the highest concentration of ganglion cells. These cells are responsible for conveying a retinal image from our eyes to our brain. Unlike a camera, the spatial distribution of ganglion cells is highly non-uniform. As a result, our brain receives a foveated image:
A foveated image with a center of gaze on the bee (left) and butterfly (right)
Despite the fact that these cells cover only a fraction of our visual field, roughly 30% of our cortex is still dedicated to processing the signal that they provide. You can imagine our brain would have to be impractically large to handle the full visual field at high resolution. Suffice it to say, the amount of neural processing dedicated to vision is rather large and it would be beneficial to survival if it were used efficiently.
Attention is a fundamental property of many intelligent systems. Since the resources of any physical system are limited, it is important to allocate them in an effective manner. Attention involves the dynamic allocation of information processing resources to best accomplish a specific task. In nature, we find this very apparent in the design of animal visual systems. By moving gaze rapidly within the scene, limited neural resources are effectively spread over the entire visual scene.
Toyota HSR Trained with DART to Make a Bed.
In Imitation Learning (IL), also known as Learning from Demonstration (LfD), a
robot learns a control policy from analyzing demonstrations of the policy
performed by an algorithmic or human supervisor. For example, to teach a robot
make a bed, a human would tele-operate a robot to perform the task to provide
examples. The robot then learns a control policy, mapping from images/states to
actions which we hope will generalize to states that were not encountered during
There are two variants of IL: Off-Policy, or Behavior Cloning, where the
demonstrations are given independent of the robot’s policy. However, when the
robot encounters novel risky states it may not have learned corrective actions.
This occurs because of “covariate shift” a known challenge, where the states
encountered during training differ from the states encountered during testing,
reducing robustness. Common approaches to reduce covariate shift are On-Policy
methods, such as DAgger, where the evolving robot’s policy is executed and the
supervisor provides corrective feedback. However, On-Policy methods can be
difficult for human supervisors, potentially dangerous, and computationally
This post presents a robust Off-Policy algorithm called DART and summarizes how
injecting noise into the supervisor’s actions can improve robustness. The
injected noise allows the supervisor to provide corrective examples for the type
of errors the trained robot is likely to make. However, because the optimized
noise is small, it alleviates the difficulties of On-Policy methods. Details on
DART are in a paper that will be presented at the 1st Conference on Robot Learning in
We evaluate DART in simulation with an algorithmic supervisor on MuJoCo tasks
(Walker, Humanoid, Hopper, Half-Cheetah) and physical experiments with human
supervisors training a Toyota HSR robot to perform grasping in clutter, where a
robot must search through clutter for a goal object. Finally, we show how
DART can be applied in a complex system that leverages both classical robotics
and learning techniques to teach the first robot to make a bed. For
researchers who want to study and use robust Off-Policy approaches, we
additionally announce the release of
Deep imitation learning and deep reinforcement learning have potential to learn
robot control policies that map high-dimensional sensor inputs to controls.
While these approaches have been very successful at learning short duration tasks, such
as grasping (Pinto and Gupta 2016, Levine et al. 2016) and peg insertion (Levine
et al. 2016), scaling learning to longer time horizons can require a prohibitive
amount of demonstration data—whether acquired from experts or self-supervised.
Long-duration sequential tasks suffer from the classic problem of “temporal
credit assignment”, namely, the difficulty in assigning credit (or blame) to
actions under uncertainty of the time when their consequences are observed
(Sutton 1984). However, long-term behaviors are often composed of short-term
skills that solve decoupled subtasks. Consider designing a controller for
parallel parking where the overall task can be decomposed into three phases:
pulling up, reversing, and adjusting. Similarly, assembly tasks can often be
decomposed into individual steps based on which parts need to be manipulated.
These short-term skills can be parametrized more concisely—as an analogy,
consider locally linear approximations to an overall nonlinear function—and
this reduced parametrization can be substantially easier to learn.
This post summarizes results from three recent papers that propose algorithms
that learn to decompose a longer task into shorter subtasks. We report
experiments in the context of autonomous surgical subtasks and we believe the
results apply to a variety of applications from manufacturing to home robotics.
We present three algorithms: Transition State Clustering (TSC), Sequential
Windowed Inverse Reinforcement Learning (SWIRL), and Deep Discovery of
Continuous Options (DDCO). TSC considers robustly learning important switching
events (significant changes in motion) that occur across all demonstrations.
SWIRL proposes an algorithm that approximates a value function by a sequence of
shorter term quadratic rewards. DDCO is a general framework for imitation
learning with a hierarchical representation of the action space. In retrospect,
all three algorithms are special cases of the same general framework, where the
demonstrator’s behavior is generatively modeled as a sequential composition of
unknown closed-loop policies that switch when reaching parameterized “transition
Deep reinforcement learning (deep RL) has achieved success in many tasks, such as playing video games from raw pixels (Mnih et al., 2015), playing the game of Go (Silver et al., 2016), and simulated robotic locomotion (e.g. Schulman et al., 2015). Standard deep RL algorithms aim to master a single way to solve a given task, typically the first way that seems to work well. Therefore, training is sensitive to randomness in the environment, initialization of the policy, and the algorithm implementation. This phenomenon is illustrated in Figure 1, which shows two policies trained to optimize a reward function that encourages forward motion: while both policies have converged to a high-performing gait, these gaits are substantially different from each other.
Figure 1: Trained simulated walking robots.
[credit: John Schulman and Patrick Coady (OpenAI Gym)]
Why might finding only a single solution be undesirable? Knowing only one way to act makes agents vulnerable to environmental changes that are common in the real-world. For example, consider a robot (Figure 2) navigating its way to the goal (blue cross) in a simple maze. At training time (Figure 2a), there are two passages that lead to the goal. The agent will likely commit to the solution via the upper passage as it is slightly shorter. However, if we change the environment by blocking the upper passage with a wall (Figure 2b), the solution the agent has found becomes infeasible. Since the agent focused entirely on the upper passage during learning, it has almost no knowledge of the lower passage. Therefore, adapting to the new situation in Figure 2b requires the agent to relearn the entire task from scratch.
Figure 2: A robot navigating a maze.
Since we posted our paper on “Learning to Optimize” last year, the area of optimizer learning has received growing attention. In this article, we provide an introduction to this line of work and share our perspective on the opportunities and challenges in this area.
Machine learning has enjoyed tremendous success and is being applied to a wide variety of areas, both in AI and beyond. This success can be attributed to the data-driven philosophy that underpins machine learning, which favours automatic discovery of patterns from data over manual design of systems using expert knowledge.
Yet, there is a paradox in the current paradigm: the algorithms that power machine learning are still designed manually. This raises a natural question: can we learn these algorithms instead? This could open up exciting possibilities: we could find new algorithms that perform better than manually designed algorithms, which could in turn improve learning capability.
Consider looking at a photograph of a chair.
We humans have the remarkable capacity of inferring properties about the 3D shape of the chair from this single photograph even if we might not have seen such a chair ever before.
A more representative example of our experience though is being in the same physical space as the chair and accumulating information from various viewpoints around it to build up our hypothesis of the chair’s 3D shape.
How do we solve this complex 2D to 3D inference task? What kind of cues do we use?
How do we seamlessly integrate information from just a few views to build up a holistic 3D model of the scene?
A vast body of work in computer vision has been devoted to developing algorithms which leverage various cues from images that enable this task of 3D reconstruction.
They range from monocular cues such as shading, linear perspective, size constancy etc. to binocular and even multi-view stereopsis.
The dominant paradigm for integrating multiple views has been to leverage stereopsis, i.e. if a point in the 3D world is viewed from multiple viewpoints, its location in 3D can be determined by triangulating its projections in the respective views.
This family of algorithms has led to work on Structure from Motion (SfM) and Multi-view Stereo (MVS) and have been used to produce city-scale 3D models and enable rich visual experiences such as 3D flyover maps.
With the advent of deep neural networks and their immense power in modelling visual data, the focus has recently shifted to modelling monocular cues implicitly with a CNN and predicting 3D from a single image as depth/surface orientation maps or 3D voxel grids.
In our recent work, we tried to unify these paradigms of single and multi-view 3D reconstruction.
We proposed a novel system called a Learnt Stereo Machine (LSM) that can leverage monocular/semantic cues for single-view 3D reconstruction while also being able to integrate information from multiple viewpoints using stereopsis - all within a single end-to-end learnt deep neural network.
This post was initially published on Off the Convex Path. It is reposted here with authors’ permission.
A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see Rong Ge’s and Ben Recht’s blog posts), the critical open problem is one of efficiency — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade, that provides the first provable positive answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!
Digitally reconstructing 3D geometry from images is a core problem in computer vision. There are various applications, such as movie productions, content generation for video games, virtual and augmented reality, 3D printing and many more. The task discussed in this blog post is reconstructing high quality 3D geometry from a single color image of an object as shown in the figure below.
Humans have the ability to effortlessly reason about the shapes of objects and scenes even if we only see a single image. Note that the binocular arrangement of our eyes allows us to perceive depth, but it is not required to understand 3D geometry. Even if we only see a photograph of an object we have a good understanding of its shape. Moreover, we are also able to reason about the unseen parts of objects such as the back, which is an important ability for grasping objects. The question which immediately arises is how are humans able to reason about geometry from a single image? And in terms of artificial intelligence: how can we teach machines this ability?
Be careful what you reward
“Be careful what you wish for!” – we’ve all heard it! The story of King Midas
is there to warn us of what might happen when we’re not. Midas, a king who loves
gold, runs into a satyr and wishes that everything he touches would turn to gold.
Initially, this is fun and he walks around turning items to gold. But his
happiness is short lived. Midas realizes the downsides of his wish when he hugs
his daughter and she turns into a golden statue.
We, humans, have a notoriously difficult time specifying what we actually want,
and the AI systems we build suffer from it. With AI, this warning actually
becomes “Be careful what you reward!”. When we design and deploy an AI agent
for some application, we need to specify what we want it to do, and this
typically takes the form of a reward function: a function that tells the agent
which state and action combinations are good. A car reaching its destination is
good, and a car crashing into another car is not so good.
AI research has made a lot of progress on algorithms for generating AI behavior
that performs well according to the stated reward function, from classifiers
that correctly label images with what’s in them, to cars that are starting to
drive on their own. But, as the example of King Midas teaches us, it’s not the
stated reward function that matters: what we really need are algorithms for
generating AI behavior that performs well according to the designer or user’s
intended reward function.
Our recent work on Cooperative
Inverse Reinforcement Learning formalizes and investigates optimal
solutions to this value alignment problem — the joint problem of eliciting
and optimizing a user’s intended objective.