AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

Overview of AMPLIFY. Our approach decomposes policy learning into forward and inverse dynamics, using latent keypoint motion as an intermediate representation. The forward model can be trained on any video data, while the inverse model can be trained on any interaction data. This modular design enables learning from diverse data sources and generalizing to tasks with zero action data.

Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it.

Key Contributions

We present the first latent keypoint dynamics model and investigate crucial design choices.
We demonstrate state-of-the-art keypoint prediction accuracy on three large-scale video datasets.
We train a data-efficient and generalizable policy that can learn from action-free human data.
We apply latent motions to conditional video generation, outperforming previous baselines.

Why Actionless Motion Priors?

Behavior Cloning (BC) approaches require prohibitively large amounts of action-labeled expert demonstrations. In contrast, AMPLIFY leverages abundant action-free video data by:

Learning a compact latent space for keypoint motion that efficiently captures action-relevant dynamics
Decoupling the challenges of learning what motion corresponds to a task from how to execute it
Enabling independent scaling of the forward and inverse dynamics models using heterogeneous data sources

Method

AMPLIFY consists of three stages:

Motion Tokenization: Keypoint tracks are compressed into a discrete latent space using Finite Scalar Quantization (FSQ). For each timestep and point, the decoder predicts motions in a local window.
Forward Dynamics: A transformer model predicts latent motion tokens for the next T timesteps based on the current observation and task description.
Inverse Dynamics: A model decodes predicted latent motions into a sequence of robot actions for execution.

AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

Key Contributions

Why Actionless Motion Priors?

Method

Results

Track Prediction Results

Robot Policy Rollouts (4x Speed)

Generalization Results (Tasks Without Action Ground Truth)

Video Prediction Results