This page shows experiments demonstrating how updating the base policy with a randomly initialized critic causes significant deviations and unlearning. Once unlearning occurs, relearning becomes difficult due to the loss of the sparse reward signal.
In the StackCube task, a robot arm must pick up a red cube and stack it on a green cube. Initially, a pre-trained base policy successfully grasps the red cube and accurately places it on the green cube.
After fine-tuning the base policy with a randomly initialized critic for 100 gradient steps, it begins to slightly deviate from the original trajectory. It can still grasp the red cube but fails to place it on the green cube.
Following an additional 100 gradient updates (200 total), the base policy deviates further from the original trajectory. As a result, it fails to effectively grasp the red cube, marking a significant decline in its performance.