Figure 1: Our approach (PDDM) can efficiently and effectively learn complex dexterous manipulation skills in both simulation and the real world. Here, the learned model is able to control the 24-DoF Shadow Hand to rotate two free-floating Baoding balls in the palm, using just 4 hours of real-world data with no prior knowledge/assumptions of system or environment dynamics.
Dexterous manipulation with multi-fingered hands is a grand challenge in robotics: the versatility of the human hand is as yet unrivaled by the capabilities of robotic systems, and bridging this gap will enable more general and capable robots. Although some real-world tasks (like picking up a television remote or a screwdriver) can be accomplished with simple parallel jaw grippers, there are countless tasks (like functionally using the remote to change the channel or using the screwdriver to screw in a nail) in which dexterity enabled by redundant degrees of freedom is critical. In fact, dexterous manipulation is defined as being object-centric, with the goal of controlling object movement through precise control of forces and motions — something that is not possible without the ability to simultaneously impact the object from multiple directions. For example, using only two fingers to attempt common tasks such as opening the lid of a jar or hitting a nail with a hammer would quickly encounter the challenges of slippage, complex contact forces, and underactuation. Although dexterous multi-fingered hands can indeed enable flexibility and success of a wide range of manipulation skills, many of these more complex behaviors are also notoriously difficult to control: They require finely balancing contact forces, breaking and reestablishing contacts repeatedly, and maintaining control of unactuated objects. Success in such settings requires a sufficiently dexterous hand, as well as an intelligent policy that can endow such a hand with the appropriate control strategy. We study precisely this in our work on Deep Dynamics Models for Learning Dexterous Manipulation.
Common approaches for control include modeling the system as well as the relevant objects in the environment, planning through this model to produce reference trajectories, and then developing a controller to actually achieve these plans. However, the success and scale of these approaches have been restricted thus far due to their need for accurate modeling of complex details, which is especially difficult for such contact-rich tasks that call for precise fine-motor skills. Learning has thus become a popular approach, offering a promising data-driven method for directly learning from collected data rather than requiring explicit or accurate modeling of the world. Model-free reinforcement learning (RL) methods, in particular, have been shown to learn policies that achieve good performance on complex tasks; however, we will show that these state-of-the-art algorithms struggle when a high degree of flexibility is required, such as moving a pencil to follow arbitrary user-specified strokes, instead of a fixed one. Model-free methods also require large amounts of data, often making them infeasible for real-world applications. Model-based RL methods, on the other hand, can be much more efficient, but have not yet been scaled up to similarly complex tasks. Our work aims to push the boundary on this task complexity, enabling a dexterous manipulator to turn a valve, reorient a cube in-hand, write arbitrary motions with a pencil, and rotate two Baoding balls around the palm. We show that our method of online planning with deep dynamics models (PDDM) addresses both of the aforementioned limitations: Improvements in learned dynamics models, together with improvements in online model-predictive control, can indeed enable efficient and effective learning of flexible contact-rich dexterous manipulation skills — and that too, on a 24-DoF anthropomorphic hand in the real world, using ~4 hours of purely real-world data to coordinate multiple free-floating objects.
Figure 2: Overview of our PDDM algorithm for online planning with deep dynamics models.
Learning complex dexterous manipulation skills on a real-world robotic system requires an algorithm that is (1) data-efficient, (2) flexible, and (3) general-purpose. First, the method must be efficient enough to learn tasks in just a few hours of interaction, in contrast to methods that utilize simulation and require hundreds of hours, days, or even years to learn. Second, the method must be flexible enough to handle a variety of tasks, so that the same model can be used to perform various different tasks. Third, the method must be general and make relatively few assumptions: It should not require a known model of the system, which can be very difficult to obtain for arbitrary objects in the world.
To this end, we adopt a model-based reinforcement learning approach for dexterous manipulation. Model-based RL methods work by learning a predictive model of the world, which predicts the next state given the current state and action. Such algorithms are more efficient than model-free learners because every trial provides rich supervision: even if the robot does not succeed at performing the task, it can use the trial to learn more about the physics of the world. Furthermore, unlike model-free learning, model-based algorithms are “off-policy,” meaning that they can use any (even old) data for learning. Typically, it is believed that this efficiency of model-based RL algorithms comes at a price: since they must go through this intermediate step of learning the model, they might not perform as well at convergence as model-free methods, which more directly optimize the reward. However, our simulated comparative evaluations show that our model-based method actually performs better than model-free alternatives when the desired tasks are very diverse (e.g., writing different characters with a pencil). This separation of modeling from control allows the model to be easily reused for different tasks – something that is not as straightforward with learned policies.
Our complete method (Figure 2), consists of learning a predictive model of the environment (denoted $f_\theta(s,a) = s’$), which can then be used to control the robot by planning a course of action at every time step through a sampling-based planning algorithm. Learning proceeds as follows: data is iteratively collected by attempting the task using the latest model, updating the model using this experience, and repeating. Although the basic design of our model-based RL algorithms has been explored in prior work, the particular design decisions that we made were crucial to its performance. We utilize an ensemble of models, which accurately fits the dynamics of our robotic system, and we also utilize a more powerful sampling-based planner that preferentially samples temporally correlated action sequences as well as performs reward-weighted updates to the sampling distribution. Overall, we see effective learning, a nice separation of modeling and control, and an intuitive mechanism for iteratively learning more about the world while simultaneously reasoning at each time step about what to do.
For a true test of dexterity, we look to the task of Baoding balls. Also referred to as Chinese relaxation balls, these two free-floating spheres must be rotated around each other in the palm. Requiring both dexterity and coordination, this task is commonly used for improving finger coordination, relaxing muscular tensions, and recovering muscle strength and motor skills after surgery. Baoding behaviors evolve in the high dimensional workspace of the hand and exhibit contact-rich (finger-finger, finger-ball, and ball-ball) interactions that are hard to reliably capture, either analytically or even in a physics simulator. Successful baoding behavior on physical hardware requires not only learning about these interactions via real world experiences, but also effective planning to find precise and coordinated maneuvers while avoiding task failure (e.g., dropping the balls).
For our experiments, we use the ShadowHand — a 24-DoF five-fingered anthropomorphic hand. In addition to ShadowHand’s inbuilt proprioceptive sensing at each joint, we use a 280x180 RGB stereo image pair that is fed into a separately pretrained tracker to produce 3D position estimates for the two Baoding balls. To enable continuous experimentation in the real world, we developed an automated reset mechanism (Figure 3) that consists of a ramp and an additional robotic arm: The ramp funnels the dropped Baoding balls to a specific position and then triggers the 7-DoF Franka-Emika arm to use its parallel jaw gripper to pick them up and return them to the ShadowHand’s palm to resume training. We note that the entire training procedure is performed using the hardware setup described above, without the aid of any simulation data.
Figure 3: Automated reset procedure, where the Franka-Emika arm gathers and resets the Baoding Balls, in order for the ShadowHand to continue its training.
During the initial phase of the learning, the hand continues to drop both balls, since that is the very likely outcome before it knows how to solve the task. Later, it learns to keep the balls in the palm to avoid the penalty incurred due to dropping. As learning improves, progress in terms of half-rotations start to emerge around 30 minutes of training. Getting the balls past this 90-degree orientation is a difficult maneuver, and PDDM spends a moderate amount of time here: To get past this point, notice the transition that must happen (in the 3rd video panel of Figure 4), from first controlling the objects with the pinky, and then controlling them indirectly through hand motion, and finally getting to control them with the thumb. By ~2 hours, the hand can reliably make 90-degree turns, frequently make 180-degree turns, and sometimes even make turns with multiple rotations.
Figure 4: Training progress on the ShadowHand hardware. From left to right: 0-0.25 hours, 0.25-0.5 hours, 0.5-1.5 hours, ~2 hours.
Although we presented the PDDM algorithm in light of the Baoding task, it is very generic, and we show it below in Figure 5 working on a suite of simulated dexterous manipulation tasks. These tasks illustrate various challenges presented by contact-rich dexterous manipulation tasks — high dimensionality of the hand, intermittent contact dynamics involving hand and objects, prevalence of constraints that must be respected and utilized to effectively manipulate objects, and catastrophic failures from dropping objects from the hand. These tasks not only require precise understanding of the rich contact interactions but also require carefully coordinated and planned movements.
Figure 5: Result of PDDM solving simulated dexterous manipulation tasks. From left to right: 9 DOF D'Claw turning valve to random (green) targets (~20 min of data), 16 dof D'Hand pulling a weight via the manipulation of a flexible rope (~1 hour of data), 24 DOF ShadowHand performing in-hand reorientation of a free-floating cube to random (shown) targets (~1 hour of data), 24 DOF ShadowHand following desired trajectories with tip of a free-floating pencil (~1-2 hours of data). Note that the amount of data is measured in terms of the real-world equivalent (e.g., 100 data points where each step represents 0.1 seconds would represent 10 seconds worth of data).
Since PDDM learns dynamics models as opposed to task-specific policies or policy-conditioned value functions, a given model can then be reused when planning for different but related tasks. In Figure 6 below, we demonstrate that the model trained for the Baoding task of performing counterclockwise rotations (left) can be repurposed to move a single ball to a goal location in the hand (middle) or to perform clockwise rotations (right) instead of the learned counterclockwise ones.
Figure 6: Model reuse on simulated tasks. Left: train model on CCW Baoding task. Middle: reuse that model for go-to single location task. Right: reuse that same model for CW Baoding task.
We study the flexibility of PDDM by experimenting with handwriting, where the base of the hand is fixed and arbitrary characters need to be written through the coordinated movement of the fingers and wrist. Although even writing a fixed trajectory is challenging, we see that writing arbitrary trajectories requires a degree of flexibility and coordination that is exceptionally challenging for prior methods. PDDM’s separation of modeling and task-specific control allows for generalization across behaviors, as opposed to discovering and memorizing the answer to a specific task/movement. In Figure 7 below, we show PDDM’s handwriting results that were trained on random paths for the green dot but then tested in a zero-shot fashion to write numerical digits.
Figure 7: Flexibility of the learned handwriting model, which was trained to follow random paths of the green dot, but shown here to write some digits.
Our results show that PDDM can be used to learn challenging dexterous manipulation tasks, including controlling free-floating objects, agile finger gaits for repositioning objects in the hand, and precise control of a pencil to write user-specified strokes. In addition to testing PDDM on our simulated suite of tasks to analyze various algorithmic design decisions as well as to perform comparisons to other state-of-the-art model-based and model-free algorithms, we also show PDDM learning the Baoding Balls task on a real-world 24-DoF anthropomorphic hand using just a few hours of entirely real-world interaction. Since model-based techniques do indeed show promise on complex tasks, exciting directions for future work would be to study methods for planning at different levels of abstraction to enable success on sparse-reward or long-horizon tasks, as well as to study the effective integration of additional sensing modalities, such as vision and touch, into these models to better understand the world and expand the boundaries of what our robots can do. Can our robotic hand braid someone’s hair? Crack an egg and carefully handle the shell? Untie a knot? Button up all the buttons of a shirt? Tie shoelaces? With the development of models that can understand the world, along with planners that can effectively use those models, we hope the answer to all of these questions will become ‘yes.’
This work was done at Google Brain, and the authors are Anusha Nagabandi, Kurt Konolige, Sergey Levine, and Vikash Kumar. The authors would also like to thank Michael Ahn for his frequent software and hardware assistance, and Sherry Moore for her work on setting up the drivers and code for working with our ShadowHand.