Chain-of-Thought Predictive Control

Zhiwei Jia, Fangchen Liu, Vineet Thumuluri, Linghao Chen, Zhiao Huang, Hao Su

UC San Diego, UC Berkeley, Zhejiang University

pdf (updated) | ArXiv  |  Code  |  Slides





Introduction

We study generalizable policy learning from demonstrations for complex low-level control tasks (e.g., contact-rich object manipulations). We propose an imitation learning method that reformulates the hierarchical principles in policy learning (i.e., temporal abstraction and high-level planning). As a step towards decision foundation models, our design can utilize scalable, albeit sub-optimal, demonstrations. Specifically, we find certain short subsequences of the demos, i.e. the chain-of-thought (CoT), reflect their hierarchical structures by marking the completion of subgoals in the tasks. Our model learns to dynamically predict the entire CoT as coherent and structured long-term action guidance and consistently outperforms typical subgoal-conditioned policies. On the other hand, such CoT facilitates generalizable policy learning as they exemplify the decision patterns shared among demos (even those with heavy noises and randomness). Our method, Chain-of-Thought Predictive Control (CoTPC), significantly outperforms existing ones on challenging low-level manipulation tasks.

Sim-to-real from CoTPC





CoTPC Solves Hard Low-level Control Tasks

For instance, peg insertion features

We study imitation learning from scalable yet sub-optimal demos


Leveraging Hierarchical Structures via Chain-of-Thought

We define key states as those that mark the completion of subgoals, which usually admit much fewer variations than the rest of trajectories and share generalizable patterns among trajectories. We call them Chain-of-Thought (CoT), which can be easily obtained from the demos without knowing the details of the demonstrators.

.

.



Chain-of-Thought as Coherent Long-term Action Plans

At each timestep, CoTPC updates its jointly predicted CoT as coherent long-term action plans that guide the low-level action predictions, outperforming common subgoal-conditioned policies (e.g., the Hierarchical version of Decision Transformer) that only predict the immediate next subgoal every K steps. CoTPC also handles dynamic control better. 

Ablation study shows that CoTPC outperforms the alternative subgoal-conditioned policy for its generalizability.

Coupled Chain-of-Thought and Action Modeling

The CoT is predicted in the latent space and is learned together with action predictions by sharing the Transformer network (GPT) with the design of the hybrid masking strategy.

Experiment Results

We evaluate using seen and unseen env. configs (including 0-shot transfer to unseen geometries). Note that due to the complexity of the tasks and the sub-optimality of the demos, many SoTA baselines struggle. See experiments of additional tasks in our paper.



Below is an illustration of some sampled geometric variations we used for evaluation in our experiments.

Paper

Chain-of-Thought Predictive Control

Zhiwei Jia, Fangchen Liu, Vineet Thumuiluri, Linghao Chen, Zhiao Huang, Hao Su

[ICLR 2023 RRL Workshop] [arXiv] [updated version (under review)]

Citation

If you find our approach useful or inspiring, please consider citing the paper as follows:

@article{jia2023chain,

  title={Chain-of-Thought Predictive Control},

  author={Jia, Zhiwei and Liu, Fangchen and Thumuluri, Vineet and Chen, Linghao and Huang, Zhiao and Su, Hao},

  journal={arXiv preprint arXiv:2304.00776},

  year={2023}

}