Learning Universal Policies via Text-Guided Video Generation

Yilun Du*¹², Sherry Yang*²³, Bo Dai², Hanjun Dai², Ofir Nachum², Joshua B. Tenenbaum¹, Dale Schuurmans²⁴, Pieter Abbeel³

¹ MIT ² Google Brain ³ UC Berkeley ⁴ University of Alberta

*indicates equal contribution.

NeurIPS 2023 (Spotlight)

description Paper code Video Modeling Codebase

Abstract

A goal of artificial intelligence is to construct a agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with impressive abilities and rich combinatorial generalization of complex novel images across different domains. Motivated by this success, we investigate whether we may use such tools to construct more general-purpose agents. Specifically, we cast the sequential decision making problem as a text-conditioned video generation problem, where, given a text-encoded specification of a desired goal, a planner synthesizes a set of future frames depicting its planned actions in the future, and the actions will be extracted from the generated video. By leveraging text as our underlying goal specification, we are able to naturally combinatorially generalize to unseen goals. Our policy-as-video formulation can further represent environments with different state and action spaces in a unified space of images, enabling learning and generalization across a wide range of robotic manipulation tasks. Finally, by leveraging pretrained language embeddings and widely available videos on the internet, our formulation enables knowledge transfer through predicting highly realistic video plans of real robots.

Combinatorial Generalization

Below, we illustrate generated videos on unseen combinatorial combinations of goals. Our approach is able to synthesize a diverse set of different behaviors which satisfy unseen language subgoals.

Multitask Learning

Below, we illustrate generated videos on unseen tasks. Our approach is further able to synthesize a diverse set of different behaviors which satisfy unseen language tasks.

Real Robot Videos

Below, we further illustrate generated videos given language instructions on unseen real images. Our approach is able to synthesize a diverse set of different behaviors which satisfy language instructions.

Our approach is further able to generate videos of robot behaviors given unseen natural language instructions. Note that pretraining on a large online dataset of text/videos significantly helps generalization to unseen natural language queries.

Source Code

Most of the experiments in the paper are run at Google and the source code cannot be released. You can use this github repo for a open-source codebase for training UniPi.

Related Resources

Check out other recent work on leveraging foundation models for decision making at the FMDM workshop.

Foundation Models for Decision Making NeurIPS2022 Workshop

The FMDM workshop brings together the decision making community and the foundation models community in vision and language to confront the challenges in decision making at scale. Recordings.

Check out a list of our related papers on leverage generative models for decision making.

Is Conditional Generative Modeling All You Need for Decision-Making?

We illustrate how conditional generative modeling is a powerful paradigm for decision-making, enabling us utilize a reward conditional model to effectively perform offline RL. We further illustrate how conditional generative modeling enables us to compose multiple different constraints and skills together.

Planning with Diffusion for Flexible Behavior Synthesis

We illustrate how may utilize a trajectory level diffusion model as an effective data-driven planner. We illustrate how this planner enables us to flexibly generate long sequences of actions subject novel goals and constraints as well as perform effective offline RL.

Citation

@article{du2023learning,
  title={Learning Universal Policies via Text-Guided Video Generation},
  author={Du, Yilun and Yang, Mengjiao and Dai, Bo and Dai, Hanjun and Nachum, Ofir and Tenenbaum, Joshua B and Schuurmans, Dale and Abbeel, Pieter},
  journal={arXiv e-prints},
  pages={arXiv--2302},
  year={2023}
}

This webpage template was recycled from here.