Delayed Impact of Fair Machine Learning


Machine learning systems trained to minimize prediction error may often exhibit discriminatory behavior based on sensitive characteristics such as race and gender. One reason could be due to historical bias in the data. In various application domains including lending, hiring, criminal justice, and advertising, machine learning has been criticized for its potential to harm historically underrepresented or disadvantaged groups.

In this post, we talk about our recent work on aligning decisions made by machine learning with long term social welfare goals. Commonly, machine learning models produce a score that summarizes information about an individual in order to make decisions about them. For example, a credit score summarizes an individual’s credit history and financial activities in a way that informs the bank about their creditworthiness. Let us continue to use the lending setting as a running example.


TDM: From Model-Free to Model-Based Deep Reinforcement Learning


You’ve decided that you want to bike from your house by UC Berkeley to the Golden Gate Bridge. It’s a nice 20 mile ride, but there’s a problem: you’ve never ridden a bike before! To make matters worse, you are new to the Bay Area, and all you have is a good ol’ fashion map to guide you. How do you get started?

Let’s first figure out how to ride a bike. One strategy would be to do a lot of studying and planning. Read books on how to ride bicycles. Study physics and anatomy. Plan out all the different muscle movements that you’ll make in response to each perturbation. This approach is noble, but for anyone who’s ever learned to ride a bike, they know that this strategy is doomed to fail. There’s only one way to learn how to ride a bike: trial and error. Some tasks like riding a bike are just too complicated to plan out in your head.

Once you’ve learned how to ride your bike, how would you get to the Golden Gate Bridge? You could reuse your trial-and-error strategy. Take a few random turns and see if you end up at the Golden Gate Bridge. Unfortunately, this strategy would take a very, very long time. For this sort of problem, planning is a much faster strategy, and requires considerably less real-world experience and trial-and-error. In reinforcement learning terms, it is more sample-efficient.

Left: some skills you learn by trial and error. Right: other times, planning ahead is better.

While simple, this thought experiment highlights some important aspects of human intelligence. For some tasks, we use a trial-and-error approach, and for others we use a planning approach. A similar phenomena seems to have emerged in reinforcement learning (RL). In the parlance of RL, empirical results show that some tasks are better suited for model-free (trial-and-error) approaches, and others are better suited for model-based (planning) approaches.

However, the biking analogy also highlights that the two systems are not completely independent. In particularly, to say that learning to ride a bike is just trial-and-error is an oversimplification. In fact, when learning to bike by trial-and-error, you’ll employ a bit of planning. Perhaps your plan will initially be, “Don’t fall over.” As you improve, you’ll make more ambitious plans, such as, “Bike forwards for two meters without falling over.” Eventually, your bike-riding skills will be so proficient that you can start to plan in very abstract terms (“Bike to the end of the road.”) to the point that all there is left to do is planning and you no longer need to worry about the nitty-gritty details of riding a bike. We see that there is a gradual transition from the model-free (trial-and-error) strategy to a model-based (planning) strategy. If we could develop artificial intelligence algorithms--and specifically RL algorithms--that mimic this behavior, it could result in an algorithm that both performs well (by using trial-and-error methods early on) and is sample efficient (by later switching to a planning approach to achieve more abstract goals).

This post covers temporal difference model (TDM), which is a RL algorithm that captures this smooth transition between model-free and model-based RL. Before describing TDMs, we start by first describing how a typical model-based RL algorithm works.


Shared Autonomy via Deep Reinforcement Learning


A blind, autonomous pilot (left), suboptimal human pilot (center), and combined human-machine team (right) play the Lunar Lander game.

Imagine a drone pilot remotely flying a quadrotor, using an onboard camera to navigate and land. Unfamiliar flight dynamics, terrain, and network latency can make this system challenging for a human to control. One approach to this problem is to train an autonomous agent to perform tasks like patrolling and mapping without human intervention. This strategy works well when the task is clearly specified and the agent can observe all the information it needs to succeed. Unfortunately, many real-world applications that involve human users do not satisfy these conditions: the user's intent is often private information that the agent cannot directly access, and the task may be too complicated for the user to precisely define. For example, the pilot may want to track a set of moving objects (e.g., a herd of animals) and change object priorities on the fly (e.g., focus on individuals who unexpectedly appear injured). Shared autonomy addresses this problem by combining user input with automated assistance; in other words, augmenting human control instead of replacing it.


Towards a Virtual Stuntman


Simulated humanoid performing a variety of highly dynamic and acrobatic skills.

Motion control problems have become standard benchmarks for reinforcement learning, and deep RL methods have been shown to be effective for a diverse suite of tasks ranging from manipulation to locomotion. However, characters trained with deep RL often exhibit unnatural behaviours, bearing artifacts such as jittering, asymmetric gaits, and excessive movement of limbs. Can we train our characters to produce more natural behaviours?


Transfer Your Font Style with GANs


Left: Given movie poster, Right: New movie title generated by MC-GAN.

Text is a prominent visual element of 2D design. Artists invest significant time into designing glyphs that are visually compatible with other elements in their shape and texture. This process is labor intensive and artists often design only the subset of glyphs that are necessary for a title or an annotation, which makes it difficult to alter the text after the design is created, or to transfer an observed instance of a font to your own project.

Early research on glyph synthesis focused on geometric modeling of outlines, which is limited to particular glyph topology (e.g., cannot be applied to decorative or hand-written glyphs) and cannot be used with image input. With the rise of deep neural networks, researchers have looked at modeling glyphs from images. On the other hand, synthesizing data consistent with partial observations is an interesting problem in computer vision and graphics such as multi-view image generation, completing missing regions in images, and generating 3D shapes. Font data is an example that provides a clean factorization of style and content.

Recent advances in conditional generative adversarial networks (cGANS) [1] have been successful in many generative applications. However, they do best only with fairly specialized domains and not with general or multi-domain style transfer. Similarly, when directly used to generate fonts, cGAN models produce significant artifacts. For instance, given the following five letters,

a conditional GAN model is not successful in generating all 26 letters with the same style:


Learning Robot Objectives from Physical Human Interaction


Humans physically interact with each other every day – from grabbing someone’s hand when they are about to spill their drink, to giving your friend a nudge to steer them in the right direction, physical interaction is an intuitive way to convey information about personal preferences and how to perform a task correctly.

So why aren’t we physically interacting with current robots the way we do with each other? Seamless physical interaction between a human and a robot requires a lot: lightweight robot designs, reliable torque or force sensors, safe and reactive control schemes, the ability to predict the intentions of human collaborators, and more! Luckily, robotics has made many advances in the design of personal robots specifically developed with humans in mind.

However, consider the example from the beginning where you grab your friend’s hand as they are about to spill their drink. Instead of your friend who is spilling, imagine it was a robot. Because state-of-the-art robot planning and control algorithms typically assume human physical interventions are disturbances, once you let go of the robot, it will resume its erroneous trajectory and continue spilling the drink. The key to this gap comes from how robots reason about physical interaction: instead of thinking about why the human physically intervened and replanning in accordance with what the human wants, most robots simply resume their original behavior after the interaction ends.

We argue that robots should treat physical human interaction as useful information about how they should be doing the task. We formalize reacting to physical interaction as an objective (or reward) learning problem and propose a solution that enables robots to change their behaviors while they are performing a task according to the information gained during these interactions.


Kernel Feature Selection via Conditional Covariance Minimization


Feature selection is a common method for dimensionality reduction that encourages model interpretability. With large data sets becoming ever more prevalent, feature selection has seen widespread usage across a variety of real-world tasks in recent years, including text classification, gene selection from microarray data, and face recognition. We study the problem of supervised feature selection, which entails finding a subset of the input features that explains the output well. This practice can reduce the computational expense of downstream learning by removing features that are redundant or noisy, while simultaneously providing insight into the data through the features that remain.

Feature selection algorithms can generally be divided into three main categories: filter methods, wrapper methods, and embedded methods. Filter methods select features based on intrinsic properties of the data, independent of the learning algorithm to be used. For example, we may compute the correlation between each feature and the response variable, and select the variables with the highest correlation. Wrapper methods are more specialized in contrast, aiming to find features that optimize the performance of a specific predictor. For example, we may train multiple SVMs, each with a different subset of features, and choose the subset of features with the lowest loss on the training data. Because there are exponentially many subsets of features, wrapper methods often employ greedy algorithms. Finally, embedded methods are multipurpose techniques that incorporate feature selection and prediction into a single problem, often by optimizing an objective combining a goodness-of-fit term with a penalty on the number of parameters. One example is the LASSO method for constructing a linear model, which penalizes the coefficients with an $\ell_1$ penalty.

In this post, we propose conditional covariance minimization (CCM), a feature selection method that aims to unify the first two perspectives. We first describe our approach in the sections that follow. We then demonstrate through several synthetic experiments that our method is capable of capturing joint nonlinear relationships between collections of features. Finally, we show that our algorithm has performance comparable to or better than several other popular feature selection algorithms on a variety of real-world tasks.


Ray: A Distributed System for AI


As machine learning algorithms and techniques have advanced, more and more machine learning applications require multiple machines and must exploit parallelism. However, the infrastructure for doing machine learning on clusters remains ad-hoc. While good solutions for specific use cases (e.g., parameter servers or hyperparameter search) and high-quality distributed systems outside of AI do exist (e.g., Hadoop or Spark), practitioners developing algorithms at the frontier often build their own systems infrastructure from scratch. This amounts to a lot of redundant effort.

As an example, take a conceptually simple algorithm like Evolution Strategies for reinforcement learning. The algorithm is about a dozen lines of pseudocode, and its Python implementation doesn’t take much more than that. However, running the algorithm efficiently on a larger machine or cluster requires significantly more software engineering. The authors’ implementation involves thousands of lines of code and must define communication protocols, message serialization and deserialization strategies, and various data handling strategies.

One of Ray’s goals is to enable practitioners to turn a prototype algorithm that runs on a laptop into a high-performance distributed application that runs efficiently on a cluster (or on a single multi-core machine) with relatively few additional lines of code. Such a framework should include the performance benefits of a hand-optimized system without requiring the user to reason about scheduling, data transfers, and machine failures.


Physical Adversarial Examples Against Deep Neural Networks


This post is based on recent research by Ivan Evtimov, Kevin Eykholt, Earlence Fernandes, Tadayoshi Kohno, Bo Li, Atul Prakash, Amir Rahmati, Dawn Song, and Florian Tramèr.

Deep neural networks (DNNs) have enabled great progress in a variety of application areas, including image processing, text analysis, and speech recognition. DNNs are also being incorporated as an important component in many cyber-physical systems. For instance, the vision system of a self-driving car can take advantage of DNNs to better recognize pedestrians, vehicles, and road signs. However, recent research has shown that DNNs are vulnerable to adversarial examples: Adding carefully crafted adversarial perturbations to the inputs can mislead the target DNN into mislabeling them during run time. Such adversarial examples raise security and safety concerns when applying DNNs in the real world. For example, adversarially perturbed inputs could mislead the perceptual systems of an autonomous vehicle into misclassifying road signs, with potentially catastrophic consequences.

There have been several techniques proposed to generate adversarial examples and to defend against them. In this blog post we will briefly introduce state-of-the-art algorithms to generate digital adversarial examples, and discuss our algorithm to generate physical adversarial examples on real objects under varying environmental conditions. We will also provide an update on our efforts to generate physical adversarial examples for object detectors.


Reverse Curriculum Generation for Reinforcement Learning Agents


Reinforcement Learning (RL) is a powerful technique capable of solving complex tasks such as locomotion, Atari games, racing games, and robotic manipulation tasks, all through training an agent to optimize behaviors over a reward function. There are many tasks, however, for which it is hard to design a reward function that is both easy to train and that yields the desired behavior once optimized. Suppose we want a robotic arm to learn how to place a ring onto a peg. The most natural reward function would be for an agent to receive a reward of 1 at the desired end configuration and 0 everywhere else. However, the required motion for this task–to align the ring at the top of the peg and then slide it to the bottom–is impractical to learn under such a binary reward, because the usual random exploration of our initial policy is unlikely to ever reach the goal, as seen in Video 1a. Alternatively, one can try to shape the reward function to potentially alleviate this problem, but finding a good shaping requires considerable expertise and experimentation. For example, directly minimizing the distance between the center of the ring and the bottom of the peg leads to an unsuccessful policy that smashes the ring against the peg, as in Video 1b. We propose a method to learn efficiently without modifying the reward function, by automatically generating a curriculum over start positions.

ring_fail_cross ring_shapping_cross

Video 1a: A randomly initialized policy is unable to reach the goal from most start positions, hence being unable to learn.

Video 1b: Shaping the reward with a penalty on the distance from the ring center to the peg bottom yields an undesired behavior.