Overview of AMPLIFY. Our approach decomposes policy learning into forward and inverse dynamics, using latent keypoint motion as an intermediate representation. The forward model can be trained on any video data, while the inverse model can be trained on any interaction data. This modular design enables learning from diverse data sources and generalizing to tasks with zero action data.
Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it.
Behavior Cloning (BC) approaches require prohibitively large amounts of action-labeled expert demonstrations. In contrast, AMPLIFY leverages abundant action-free video data by:
AMPLIFY consists of three stages: