AMPLIFY: Actionless Motion Priors for Robot Learning from Videos

Anonymous Authors
CoRL 2025
Paper (PDF) Code (Coming Soon)

Overview of AMPLIFY. Our approach decomposes policy learning into forward and inverse dynamics, using latent keypoint motion as an intermediate representation. The forward model can be trained on any video data, while the inverse model can be trained on any interaction data. This modular design enables learning from diverse data sources and generalizing to tasks with zero action data.

Action-labeled data for robotics is scarce and expensive, limiting the generalization of learned policies. In contrast, vast amounts of action-free video data are readily available, but translating these observations into effective policies remains a challenge. We introduce AMPLIFY, a novel framework that leverages large-scale video data by encoding visual dynamics into compact, discrete motion tokens derived from keypoint trajectories. Our modular approach separates visual motion prediction from action inference, decoupling the challenges of learning what motion defines a task from how robots can perform it.

Key Contributions

Motivation: Why Latent Motion Priors?

Behavior Cloning (BC) approaches require prohibitively large amounts of action-labeled expert demonstrations. In contrast, AMPLIFY leverages abundant action-free video data by:

Method

AMPLIFY consists of three stages:

  1. Motion Tokenization: Keypoint tracks are compressed into a discrete latent space using Finite Scalar Quantization (FSQ). For each timestep and point, the decoder predicts motions in a local window.
  2. Forward Dynamics: A transformer model predicts latent motion tokens for the next T timesteps based on the current observation and task description.
  3. Inverse Dynamics: A model decodes predicted latent motions into a sequence of robot actions for execution.

Track Prediction Results

Robot Policy Rollouts (4x Speed)

Generalization Results (Tasks Without Action Ground Truth)

Video Prediction Results