Policy Decorator

Overview. We propose Policy Decorator, a framework designed to enhance large policy models (e.g., Diffusion Policy and Behavior Transformer) through online environment interactions. Policy Decorator utilizes residual policy learning and controlled exploration strategies to achieve model-agnostic and sample-efficient performance improvements. As shown in the animation, while offline-learned policy struggles with the finer parts of the task—such as precisely inserting a peg into a hole—our refined policy achieves flawless insertions.

Video (4m39s)

Click the "cc" button at the lower right corner to show captions.

Abstract

Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks—ManiSkill and Adroit—and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies.

Combines Advantages of Base Policy and Online Learning

An intriguing property of Policy Decorator is its ability to combine the advantages of the base policy and online learning.

Base Policy (w/o Online Learning)

The offline-trained base policies can reproduce the natural and smooth motions recorded in demonstrations but may have suboptimal performance.

Ours (Base Policy + Online Residual)

Policy Decorator (ours) not only achieves remarkably high success rates but also preserves the favorable attributes of the base policy.

Online RL Policy (w/o Base Policy)

Policies solely learned by RL, though achieving good success rates, often exhibit jerky actions, rendering them unsuitable for real-world applications.

Improve Various SOTA Policy Models

Our framework, Policy Decorator, improves various state-of-the-art policy models, such as Behavior Transformer and Diffusion Policy, boosting their success rates to nearly 100% on challenging robotic tasks. It also significantly outperforms top-performing baselines from both finetuning and non-finetuning method families.

Method Overview

Policy Decorator learns a residual policy via reinforcement learning with sparse reward. On top of it, a set of controlled exploration mechanisms is implemented. Controlled exploration (Progressive Exploration Schedule + Bounded Residual Actions) enables the RL agent (both base policy and residual policy) to continuously receive sufficient success signals while exploring the environments.