Ray: A Distributed System for AI

    

As machine learning algorithms and techniques have advanced, more and more machine learning applications require multiple machines and must exploit parallelism. However, the infrastructure for doing machine learning on clusters remains ad-hoc. While good solutions for specific use cases (e.g., parameter servers or hyperparameter search) and high-quality distributed systems outside of AI do exist (e.g., Hadoop or Spark), practitioners developing algorithms at the frontier often build their own systems infrastructure from scratch. This amounts to a lot of redundant effort.

As an example, take a conceptually simple algorithm like Evolution Strategies for reinforcement learning. The algorithm is about a dozen lines of pseudocode, and its Python implementation doesn’t take much more than that. However, running the algorithm efficiently on a larger machine or cluster requires significantly more software engineering. The authors’ implementation involves thousands of lines of code and must define communication protocols, message serialization and deserialization strategies, and various data handling strategies.

One of Ray’s goals is to enable practitioners to turn a prototype algorithm that runs on a laptop into a high-performance distributed application that runs efficiently on a cluster (or on a single multi-core machine) with relatively few additional lines of code. Such a framework should include the performance benefits of a hand-optimized system without requiring the user to reason about scheduling, data transfers, and machine failures.

Continue

Physical Adversarial Examples Against Deep Neural Networks

    

This post is based on recent research by Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, Dawn Song, and Florian Tramèr.

Deep neural networks (DNNs) have enabled great progress in a variety of application areas, including image processing, text analysis, and speech recognition. DNNs are also being incorporated as an important component in many cyber-physical systems. For instance, the vision system of a self-driving car can take advantage of DNNs to better recognize pedestrians, vehicles, and road signs. However, recent research has shown that DNNs are vulnerable to adversarial examples: Adding carefully crafted adversarial perturbations to the inputs can mislead the target DNN into mislabeling them during run time. Such adversarial examples raise security and safety concerns when applying DNNs in the real world. For example, adversarially perturbed inputs could mislead the perceptual systems of an autonomous vehicle into misclassifying road signs, with potentially catastrophic consequences.

There have been several techniques proposed to generate adversarial examples and to defend against them. In this blog post we will briefly introduce state-of-the-art algorithms to generate digital adversarial examples, and discuss our algorithm to generate physical adversarial examples on real objects under varying environmental conditions. We will also provide an update on our efforts to generate physical adversarial examples for object detectors.

Continue

Reverse Curriculum Generation for Reinforcement Learning Agents

    

Reinforcement Learning (RL) is a powerful technique capable of solving complex tasks such as locomotion, Atari games, racing games, and robotic manipulation tasks, all through training an agent to optimize behaviors over a reward function. There are many tasks, however, for which it is hard to design a reward function that is both easy to train and that yields the desired behavior once optimized. Suppose we want a robotic arm to learn how to place a ring onto a peg. The most natural reward function would be for an agent to receive a reward of 1 at the desired end configuration and 0 everywhere else. However, the required motion for this task–to align the ring at the top of the peg and then slide it to the bottom–is impractical to learn under such a binary reward, because the usual random exploration of our initial policy is unlikely to ever reach the goal, as seen in Video 1a. Alternatively, one can try to shape the reward function to potentially alleviate this problem, but finding a good shaping requires considerable expertise and experimentation. For example, directly minimizing the distance between the center of the ring and the bottom of the peg leads to an unsuccessful policy that smashes the ring against the peg, as in Video 1b. We propose a method to learn efficiently without modifying the reward function, by automatically generating a curriculum over start positions.

ring_fail_cross ring_shapping_cross

Video 1a: A randomly initialized policy is unable to reach the goal from most start positions, hence being unable to learn.

Video 1b: Shaping the reward with a penalty on the distance from the ring center to the peg bottom yields an undesired behavior.

Continue

Towards Intelligent Industrial Co-robots

    

Democratization of Robots in Factories

In modern factories, human workers and robots are two major workforces. For safety concerns, the two are normally separated with robots confined in metal cages, which limits the productivity as well as the flexibility of production lines. In recent years, attention has been directed to remove the cages so that human workers and robots may collaborate to create a human-robot co-existing factory. Manufacturers are interested in combining human’s flexibility and robot’s productivity in flexible production lines. The potential benefits of industrial co-robots are huge and extensive, e.g. they may be placed in human-robot teams in flexible production lines, where robot arms and human workers cooperate in handling workpieces, and automated guided vehicles (AGV) co-inhabit with human workers to facilitate factory logistics. In the factories of the future, more and more human-robot interactions are anticipated to take place. Unlike traditional robots that work in structured and deterministic environments, co-robots need to operate in highly unstructured and stochastic environments. The fundamental problem is how to ensure that co-robots operate efficiently and safely in dynamic uncertain environments. In this post, we introduce the robot safe interaction system developed in the Mechanical System Control (MSC) lab.


Fig. 1. The factory of the future with human-robot collaborations.

Continue

FaSTrack: Ensuring Safe Real-Time Navigation of Dynamic Systems

    

The Problem: Fast and Safe Motion Planning

Real time autonomous motion planning and navigation is hard, especially when we care about safety. This becomes even more difficult when we have systems with complicated dynamics, external disturbances (like wind), and a priori unknown environments. Our goal in this work is to “robustify” existing real-time motion planners to guarantee safety during navigation of dynamic systems.

Continue

Model-based Reinforcement Learning with Neural Network Dynamics

    

fig1a fig1b
Fig 1. A learned neural network dynamics model enables a hexapod robot to learn to run and follow desired trajectories, using just 17 minutes of real-world experience.

Enabling robots to act autonomously in the real-world is difficult. Really, really difficult. Even with expensive robots and teams of world-class researchers, robots still have difficulty autonomously navigating and interacting in complex, unstructured environments.

Why are autonomous robots not out in the world among us? Engineering systems that can cope with all the complexities of our world is hard. From nonlinear dynamics and partial observability to unpredictable terrain and sensor malfunctions, robots are particularly susceptible to Murphy’s law: everything that can go wrong, will go wrong. Instead of fighting Murphy’s law by coding each possible scenario that our robots may encounter, we could instead choose to embrace this possibility for failure, and enable our robots to learn from it. Learning control strategies from experience is advantageous because, unlike hand-engineered controllers, learned controllers can adapt and improve with more data. Therefore, when presented with a scenario in which everything does go wrong, although the robot will still fail, the learned controller will hopefully correct its mistake the next time it is presented with a similar scenario. In order to deal with complexities of tasks in the real world, current learning-based methods often use deep neural networks, which are powerful but not data efficient: These trial-and-error based learners will most often still fail a second time, and a third time, and often thousands to millions of times. The sample inefficiency of modern deep reinforcement learning methods is one of the main bottlenecks to leveraging learning-based methods in the real-world.

We have been investigating sample-efficient learning-based approaches with neural networks for robot control. For complex and contact-rich simulated robots, as well as real-world robots (Fig. 1), our approach is able to learn locomotion skills of trajectory-following using only minutes of data collected from the robot randomly acting in the environment. In this blog post, we’ll provide an overview of our approach and results. More details can be found in our research papers listed at the bottom of this post, including this paper with code here.

Continue

The Emergence of a Fovea while Learning to Attend

    

Why we need Attention

What we see through our eyes is only a very small part of the world around us. At any given time our eyes are sampling only a fraction of the surrounding light field. Even within this fraction, most of the resolution is dedicated to the center of gaze which has the highest concentration of ganglion cells. These cells are responsible for conveying a retinal image from our eyes to our brain. Unlike a camera, the spatial distribution of ganglion cells is highly non-uniform. As a result, our brain receives a foveated image:

A foveated image with a center of gaze on the bee (left) and butterfly (right) (source).

Despite the fact that these cells cover only a fraction of our visual field, roughly 30% of our cortex is still dedicated to processing the signal that they provide. You can imagine our brain would have to be impractically large to handle the full visual field at high resolution. Suffice it to say, the amount of neural processing dedicated to vision is rather large and it would be beneficial to survival if it were used efficiently.

Attention is a fundamental property of many intelligent systems. Since the resources of any physical system are limited, it is important to allocate them in an effective manner. Attention involves the dynamic allocation of information processing resources to best accomplish a specific task. In nature, we find this very apparent in the design of animal visual systems. By moving gaze rapidly within the scene, limited neural resources are effectively spread over the entire visual scene.

Continue

DART: Noise Injection for Robust Imitation Learning

    

Bed-Making GIF
Toyota HSR Trained with DART to Make a Bed.

In Imitation Learning (IL), also known as Learning from Demonstration (LfD), a robot learns a control policy from analyzing demonstrations of the policy performed by an algorithmic or human supervisor. For example, to teach a robot make a bed, a human would tele-operate a robot to perform the task to provide examples. The robot then learns a control policy, mapping from images/states to actions which we hope will generalize to states that were not encountered during training.

There are two variants of IL: Off-Policy, or Behavior Cloning, where the demonstrations are given independent of the robot’s policy. However, when the robot encounters novel risky states it may not have learned corrective actions. This occurs because of “covariate shift” a known challenge, where the states encountered during training differ from the states encountered during testing, reducing robustness. Common approaches to reduce covariate shift are On-Policy methods, such as DAgger, where the evolving robot’s policy is executed and the supervisor provides corrective feedback. However, On-Policy methods can be difficult for human supervisors, potentially dangerous, and computationally expensive.

This post presents a robust Off-Policy algorithm called DART and summarizes how injecting noise into the supervisor’s actions can improve robustness. The injected noise allows the supervisor to provide corrective examples for the type of errors the trained robot is likely to make. However, because the optimized noise is small, it alleviates the difficulties of On-Policy methods. Details on DART are in a paper that will be presented at the 1st Conference on Robot Learning in November.

We evaluate DART in simulation with an algorithmic supervisor on MuJoCo tasks (Walker, Humanoid, Hopper, Half-Cheetah) and physical experiments with human supervisors training a Toyota HSR robot to perform grasping in clutter, where a robot must search through clutter for a goal object. Finally, we show how DART can be applied in a complex system that leverages both classical robotics and learning techniques to teach the first robot to make a bed. For researchers who want to study and use robust Off-Policy approaches, we additionally announce the release of our codebase on GitHub.

Continue

Learning Long Duration Sequential Task Structure From Demonstrations with Application in Surgical Robotics

    


Deep imitation learning and deep reinforcement learning have potential to learn robot control policies that map high-dimensional sensor inputs to controls. While these approaches have been very successful at learning short duration tasks, such as grasping (Pinto and Gupta 2016, Levine et al. 2016) and peg insertion (Levine et al. 2016), scaling learning to longer time horizons can require a prohibitive amount of demonstration data—whether acquired from experts or self-supervised. Long-duration sequential tasks suffer from the classic problem of “temporal credit assignment”, namely, the difficulty in assigning credit (or blame) to actions under uncertainty of the time when their consequences are observed (Sutton 1984). However, long-term behaviors are often composed of short-term skills that solve decoupled subtasks. Consider designing a controller for parallel parking where the overall task can be decomposed into three phases: pulling up, reversing, and adjusting. Similarly, assembly tasks can often be decomposed into individual steps based on which parts need to be manipulated. These short-term skills can be parametrized more concisely—as an analogy, consider locally linear approximations to an overall nonlinear function—and this reduced parametrization can be substantially easier to learn.

This post summarizes results from three recent papers that propose algorithms that learn to decompose a longer task into shorter subtasks. We report experiments in the context of autonomous surgical subtasks and we believe the results apply to a variety of applications from manufacturing to home robotics. We present three algorithms: Transition State Clustering (TSC), Sequential Windowed Inverse Reinforcement Learning (SWIRL), and Deep Discovery of Continuous Options (DDCO). TSC considers robustly learning important switching events (significant changes in motion) that occur across all demonstrations. SWIRL proposes an algorithm that approximates a value function by a sequence of shorter term quadratic rewards. DDCO is a general framework for imitation learning with a hierarchical representation of the action space. In retrospect, all three algorithms are special cases of the same general framework, where the demonstrator’s behavior is generatively modeled as a sequential composition of unknown closed-loop policies that switch when reaching parameterized “transition states”.

Continue

Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning

    

Deep reinforcement learning (deep RL) has achieved success in many tasks, such as playing video games from raw pixels (Mnih et al., 2015), playing the game of Go (Silver et al., 2016), and simulated robotic locomotion (e.g. Schulman et al., 2015). Standard deep RL algorithms aim to master a single way to solve a given task, typically the first way that seems to work well. Therefore, training is sensitive to randomness in the environment, initialization of the policy, and the algorithm implementation. This phenomenon is illustrated in Figure 1, which shows two policies trained to optimize a reward function that encourages forward motion: while both policies have converged to a high-performing gait, these gaits are substantially different from each other.

Figure 1: Trained simulated walking robots.
Figure 1: Trained simulated walking robots.
[credit: John Schulman and Patrick Coady (OpenAI Gym)]

Why might finding only a single solution be undesirable? Knowing only one way to act makes agents vulnerable to environmental changes that are common in the real-world. For example, consider a robot (Figure 2) navigating its way to the goal (blue cross) in a simple maze. At training time (Figure 2a), there are two passages that lead to the goal. The agent will likely commit to the solution via the upper passage as it is slightly shorter. However, if we change the environment by blocking the upper passage with a wall (Figure 2b), the solution the agent has found becomes infeasible. Since the agent focused entirely on the upper passage during learning, it has almost no knowledge of the lower passage. Therefore, adapting to the new situation in Figure 2b requires the agent to relearn the entire task from scratch.

maze_one_path

2a

maze-two-paths

2b

Figure 2: A robot navigating a maze.

Continue