Model-based reinforcement learning (MBRL) is a variant of the iterative learning framework, reinforcement learning, that includes a structured component of the system that is solely optimized to model the environment dynamics. Learning a model is broadly motivated from biology, optimal control, and more – it is grounded in natural human intuition of planning before acting. This intuitive grounding, however, results in a more complicated learning process. In this post, we discuss how model-based reinforcement learning is more susceptible to parameter tuning and how AutoML can help in finding very well performing parameter settings and schedules. Below, left is the expected behavior of an agent maximizing velocity on a “Half Cheetah” robotic task, and to the right is what our paper with hyperparameter tuning finds.

## MBRL

Model-based reinforcement learning (MBRL) is an iterative framework for solving tasks in a partially understood environment. There is an agent that repeatedly tries to solve a problem, accumulating state and action data. With that data, the agent creates a structured learning tool – a dynamics model – to reason about the world. With the dynamics model, the agent decides how to act by predicting into the future. With those actions, the agent collects more data, improves said model, and hopefully improves future actions.

## AutoML

Humans are pretty poor at internalizing higher-dimensional relationships. Unfortunately, all ML systems come with hyperparameters that have complex higher-dimensional relationships. Manually searching for configurations or schedules that work well is a tedious and unrewarding task, so let’s let a computer do it for us. Automated Machine Learning (AutoML) is a field dedicated to the study of using machine learning algorithms to tune our machine learning tools. However, there have not been many attempts in using AutoML methods for RL so far, (for more on AutoRL see this blog post) even though, given the success of AutoML in Supervised Learning, one could expect a bigger impact in higher-dimensional RL. This is partly due to the harder problem of dynamic hyperparameter tuning (where hyperparameters can change within a run), but more on that later.

# Why MBRL is challenging to tune

Why may we see even more impact of AutoML on MBRL when compared to model-freel RL? First off, more machine learning parts equals a harder problem, so there is a greater potential to make hyperparameter tuning way more impactful to MBRL. Normally, a graduate student fine tunes one problem at a time with a handful of variables, but in MBRL there are two weirdly intertwined systems: the model and the planner (the objectives are mismatched). So, no human is likely to be able to find the perfect parameters. Much research progress, still, is yielding to luck when it comes to finding the right hyperparameters. Even though it is an open problem to see if computers can find the true optimum, as mentioned before, computers are much better at optimizing in high-dimensional spaces such as RL.

Secondly, the benchmark where most RL algorithms are tested in recent work – a simulator called Mujoco – has been used for years and performance of the best algorithms is close to maxing out the realistic behaviors in the simulator. Such a realistic solution for Half Cheetah can be seen in the video on the left at the beginning of the post.

Mujoco showed up as a favorite of Deep RL because it was available when a massive growth phase was coming through. It is a decent simulator, relatively lightweight, and easy-enough to use (though, Mujoco is expensive, has fairly strict licensing, and not a completely accurate portrayal of the real world). Mujoco has worked for individual researchers and teams growing into a new area, but not necessarily great for the long-term health of the field. Researchers have tracked the so-called state-of-the-art (SOTA) performance of algorithms across the group of tasks available on the benchmark. This has led to relying on this benchmark as a surrogate objective for what humans truly want to optimize – performance in the real world – and left progress in the research field vulnerable to being gamed. And in such problems, where optimal solutions may be unintuitive, computers are even more likely to substantially outperform humans.

# Static vs. Dynamic Parameter Tuning

Reinforcement learning, or any iterative framework for that matter, poses an interesting challenge for AutoML and parameter tuning research: the best parameters might change over time. The ideal parameters shift because the data used to train any model changes over time. This is different from a more classical approach to AutoML – on a Supervised Learning task, static parameter configurations are much more likely to perform well (such as the weights and biases of a deployed vision model).

When the distribution of the data we are using changes over time, we look to dynamic hyperparameter tuning where the hyperparameters of the model or optimizer are adapted over time. In the case of MBRL, this can have an elegant interpretation: as the agent gets more data, it can train a more accurate model, and then it may want to use that model to plan further into the future. This would translate to a dynamic tuning of the model predictive horizon hyperparameter variable used in the action optimization. Static hyperparameters – which most RL algorithms report in tables in the appendix of their papers – are likely not going to be able to deal with shifting distributions. The algorithm used to demonstrate this is Hyperband (more later in the post). It is a static tuning method and the chosen parameters show a low correlation with reward across a longer run or to another task. For more details on the specifics of static tuning, see the paper, but the crucial question is: how much can an RL agent gain with dynamics tuning. Normally, this gain is measured on the performance of an algorithm on one task.

This finding, however, does not mean that static tuning does not have its uses. In this paper, we further studied what aspects of static and dynamic are most important for a practitioner depending on what they want to achieve – transferability (across runs on the same task or across tasks) or final performance. We showed that static parameter configurations learn hyperparameter settings that are more robust to transferring. By design, dynamic configurations make many more choices about the parameter settings than static configurations, thus making it very challenging to tune dynamic configurations by hand and without automatic HPO methods. With an increasing number of decision points, it becomes more and more likely that each choice we make is specifically tailored to the environment and even the current run at hand.

Weak or even negative correlation across different budgets are found in most of the environments with static and transferred tunings, which shows different configurations perform best for different training durations and that dynamic tuning may be needed. Above shows the Spearman rank correlation of hyperparameters sampled by Hyperband across trials when training. $cor$ is the correlation coefficient, $p$ is the p-value for the hypothesis test and $n$ gives the number of configurations trained in both fidelities.

# Breaking the simulator

Combining AutoML with MBRL dramatically beats the SOTA results on a couple of Mujoco tasks by using existing MBRL algorithms with both dynamic and static parameter tuning. With sufficient hyperparameter tuning, the MBRL algorithm (PETS) literally breaks Mujoco. The Mujoco log reads something akin to:

WARNING: Nan, Inf or huge value in QACC at DOF 0. The simulation is unstable. Time = 40.4200.

This correlates to the task being solved in a non-intuitive, exploitative manner: the Half Cheetah cartwheels.

Normally, the Half Cheetah is supposed to run. However, through hyperparameter tuning, the agent can experience that stumbling and flipping over can be converted into a successful cartwheel. Chaining such cartwheels back to back allows it to build up much more speed than otherwise possible. So much, in fact, that the simulator can’t keep up and breaks. This shows that we have not yet explored the full potential of existing agent implementations and that AutoML can be a key component in doing so. The paper has a much, much wider range of results on multiple environments, but I leave that to the reader. The paper has interesting tradeoffs between optimizing the model (learning the dynamics) and the controller (solving the reward-maximization problem). Additionally, it shows how dynamically changing the parameters throughout a trial can be useful, such as increasing your model horizon as the algorithm collects data and the model becomes more accurate. We analyse in-depth the impact of design decisions on the following tuning methods:

• Population Based Training (PBT): An evolutionary approach to hyperparameter tuning, where the best performing members are modified and replace worse performing configurations (link).
• Population Based Training with Backtracking (PBT-BT): Population-based training where the agents can return to past configurations during the learning process.
• Hyperband: A Bandit-Based Approach to Hyperparameter Optimization (link).
• Random Search: A method where configurations are generated within the hyperparameter search space (example).

# Conclusion

Ultimately, this is a paper that can really move the field forward by showing the ceiling is much higher than expected for current deep RL algorithms and that new benchmarking tasks are needed to facilitate continued development of the research area. Knowing that the last generation of MBRL algorithms still has substantial performance to be cultivated motivates more interesting numerical research revisiting past methods as the future is designed.

As the Mujoco simulator seems to be reaching its maximum in terms of logged performance, it is time for a new generation of simulators and tasks to benchmark developments in deep reinforcement learning. In creating new tasks, there is an exciting opportunity to move our research methods closer to real world applications — providing a harder challenge with a growing potential payoff.

This post is based on the following paper: