Improving Policy Optimization with Generalist-Specialist Learning
Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, Hao Su
UC San Diego
Introduction
Generalization in deep RL over unseen environment variations usually requires policy learning over diverse training variations. We empirically observe that an agent trained on many variations (a generalist) tends to learn faster at the beginning, yet its performance plateaus at a less optimal level for a long time. In contrast, an agent trained only on a few variations (a specialist) can achieve higher returns. To have the best of both worlds, we propose a novel generalist-specialist learning framework (GSL) that combines joint (generalist) training and distributed (specialist) training in a well-principled manner. We show that GSL pushes the envelope of policy learning on several challenging and popular benchmarks including Procgen, Meta-World and ManiSkill.
Fork-Maze: An Illustrative Example
To show the limitation of a single agent (a generalist), we present a fork maze where the agent (blue, starting from the left side) needs to reach goals (red stars on the right side). Upon each environment reset, only one goal is specified (via a context scalar c). Essentially, the maze has variations with different goals to reach based on context c but a shared path at the beginning across all the variations.
We find that an agent jointly trained on all goals (a generalist) can suffer from catastrophic ignorance & catastrophic forgetting, i.e., the agent
ignores the context c and fails to distinguish between different goals since c plays little role in the early stages of learning
struggles to learn to solve the environment variations altogether in later stages due to difficulty in memorization for NNs.
Key Observations: Generalists vs. Specialists
Alternatively, we can train a bunch of specialists agent, each handles a subset of the variations. Such approach mitigates the issues of generalists but also comes with a cost. We have discovered key trade-offs between generalists vs. specialists, summarized below.
Generalists learn
faster (more sample-efficient) initially
worse performance in later stages
Specialists learn
slower (less sample-efficient) initially
better performance in later stages
Generalist-Specialist Learning (GSL)
A meta algorithm to have the best of both worlds (3 Steps)
Train a generalist jointly on all training env. variations
Stop when it plateaus according some criterion H
Train a bunch of specialists, each is loaded from the checkpoints of the generalist and handles a selective subset of env. variations.
Collect demonstrations from the specialists as well as the generalist.
Fine-tune the generalist with guidance from the demos.
Experiment Results
We evaluated GSL on multiple challenging benchmarks and popular RL baselines, including PPO & PPG on Procgen (left), PPO on Meta-World (middle) and SAC on ManiSkill (right). We demonstrate that, as a meta-algorithm, GSL consistently improves the baselines. Furthermore, we show that GSL is a step towards generalizable policy learning.
Paper
Improving Policy Optimization with Generalist-Specialist Learning
Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, Hao Su
Citation
If you use our framework or find it inspiring, please consider citing the paper as follows:
@InProceedings{jia2022gsl,
title={Improving Policy Optimization with Generalist-Specialist Learning},
author={Jia, Zhiwei and Li, Xuanlin and Ling, Zhan and Liu, Shuang and Wu, Yiran and Su, Hao},
booktitle={Proceedings of the 39th International Conference on Machine Learning},
pages={10104--10119},
year={2022},
}