Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

1University of California, San Diego 2Hillbot Inc.
*Equal Contribution

Descriptive alt text for the GIF

Overview. While state-of-the-art Diffusion Policy struggles with finer task details—such as failing to precisely insert a peg into a hole—Policy Decorator refines their performance to near 100%.


Video (4m39s)

Click the "cc" button at the lower right corner to show captions.


Abstract

Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity, quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks—ManiSkill and Adroit—and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies.

Combines Advantages of Base Policy and Online Learning

An intriguing property of Policy Decorator is its ability to combine the advantages of the base policy and online learning.
Base Policy (w/o Online Learning)

The offline-trained base policies can reproduce the natural and smooth motions recorded in demonstrations but may have suboptimal performance.

Ours (Base Policy + Online Residual)

Policy Decorator (ours) not only achieves remarkably high success rates but also preserves the favorable attributes of the base policy.

Online RL Policy (w/o Base Policy)

Policies solely learned by RL, though achieving good success rates, often exhibit jerky actions, rendering them unsuitable for real-world applications.


Improve Various SOTA Policy Models

Our framework, Policy Decorator, improves various state-of-the-art policy models, such as Behavior Transformer and Diffusion Policy, boosting their success rates to nearly 100% on challenging robotic tasks. It also significantly outperforms top-performing baselines from both finetuning and non-finetuning method families.
Interpolate start reference image.

Method Overview

Policy Decorator learns a residual policy via reinforcement learning with sparse reward. On top of it, a set of controlled exploration mechanisms is implemented. Controlled exploration (Progressive Exploration Schedule + Bounded Residual Actions) enables the RL agent (both base policy and residual policy) to continuously receive sufficient success signals while exploring the environments.
Interpolate start reference image.

More Visualizations

click blue texts for visualizations

What are the pros of Policy Decorator compared to base policy?

Our refined policy navigates the task's hardest parts, where the base policy struggles.

What are the pros pf Policy Decorator compared to RL policy?

Our refined policy behaves more smoothly and naturally than RL policy.

JSRL policy also performs well on some tasks, why not use it?

JSRL doesn't improve the base policy but learns a new one, failing to preserve desired traits like smooth and natural motion.

What happens if we fine-tune base policy with a randomly initialized critic?

Updating the base policy with a randomly initialized critic causes significant deviations and unlearning.

Any problems with vanilla Residual RL?

Random residual actions in early training stages cause the agent to deviate significantly from the base policy.