Improving Policy Optimization with Generalist-Specialist Learning

Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, Hao Su

UC San Diego

Paper

Slides

Code

Introduction

Generalization in deep RL over unseen environment variations usually requires policy learning over diverse training variations. We empirically observe that an agent trained on many variations (a generalist) tends to learn faster at the beginning, yet its performance plateaus at a less optimal level for a long time. In contrast, an agent trained only on a few variations (a specialist) can achieve higher returns. To have the best of both worlds, we propose a novel generalist-specialist learning framework (GSL) that combines joint (generalist) training and distributed (specialist) training in a well-principled manner. We show that GSL pushes the envelope of policy learning on several challenging and popular benchmarks including Procgen, Meta-World and ManiSkill.

Fork-Maze: An Illustrative Example

To show the limitation of a single agent (a generalist), we present a fork maze where the agent (blue, starting from the left side) needs to reach goals (red stars on the right side). Upon each environment reset, only one goal is specified (via a context scalar c). Essentially, the maze has variations with different goals to reach based on context c but a shared path at the beginning across all the variations.

We find that an agent jointly trained on all goals (a generalist) can suffer from catastrophic ignorance & catastrophic forgetting, i.e., the agent

ignores the context c and fails to distinguish between different goals since c plays little role in the early stages of learning
struggles to learn to solve the environment variations altogether in later stages due to difficulty in memorization for NNs.

Key Observations: Generalists vs. Specialists

Alternatively, we can train a bunch of specialists agent, each handles a subset of the variations. Such approach mitigates the issues of generalists but also comes with a cost. We have discovered key trade-offs between generalists vs. specialists, summarized below.

Generalists learn

faster (more sample-efficient) initially
worse performance in later stages

Specialists learn

slower (less sample-efficient) initially
better performance in later stages

Generalist-Specialist Learning (GSL)

A meta algorithm to have the best of both worlds (3 Steps)

Train a generalist jointly on all training env. variations
Stop when it plateaus according some criterion H
- Train a bunch of specialists, each is loaded from the checkpoints of the generalist and handles a selective subset of env. variations.
- Collect demonstrations from the specialists as well as the generalist.
Fine-tune the generalist with guidance from the demos.

Experiment Results

We evaluated GSL on multiple challenging benchmarks and popular RL baselines, including PPO & PPG on Procgen (left), PPO on Meta-World (middle) and SAC on ManiSkill (right). We demonstrate that, as a meta-algorithm, GSL consistently improves the baselines. Furthermore, we show that GSL is a step towards generalizable policy learning.