(Joint work with Ronghang Hu, Marcus Rohrbach, Trevor Darrell, Dan Klein and Kate Saenko.)
Suppose we’re building a household robot, and want it to be able to answer questions about its surroundings. We might ask questions like these:
How can we ensure that the robot can answer these questions correctly? The standard approach in deep learning is to collect a large dataset of questions, images, and answers, and train a single neural network to map directly from questions and images to answers. If most questions look like the one on the left, we have a familiar image recognition problem, and these kinds of monolithic approaches are quite effective:
But things don’t work quite so well for questions like the one on the right:
Here the network we trained has given up and guessed the most common color in the image. What makes this question so much harder? Even though the image is cleaner, the question requires many steps of reasoning: rather than simply recognizing the main object in the image, the model must first find the blue cylinder, locate the other object with the same size, and then determine its color. This is a complicated computation, and it’s a computation specific to the question that was asked. Different questions require different sequences of steps to solve.
The dominant paradigm in deep learning is a "one size fits all" approach: for whatever problem we’re trying to solve, we write down a fixed model architecture that we hope can capture everything about the relationship between the input and output, and learn parameters for that fixed model from labeled training data.
But real-world reasoning doesn’t work this way: it involves a variety of different capabilities, combined and synthesized in new ways for every new challenge we encounter in the wild. What we need is a model that can dynamically determine how to reason about the problem in front of it—a network that can choose its own structure on the fly. In this post, we’ll talk about a new class of models we call neural module networks (NMNs), which incorporate this more flexible approach to problem-solving while preserving the expressive power that makes deep learning so effective.Continue