Diffusion Policy Training Visualizer

Understanding how a robot learns to pour tea using diffusion models

Robot pouring tea

The Core Idea

Diffusion Policy learns robot actions by treating the action trajectory (a sequence of future robot movements) as something that can be gradually corrupted with noise and then recovered by a neural network.

During training, we add noise to expert action trajectories and train a neural network to predict that noise. During inference, we start from pure noise and iteratively denoise to generate actions.

Our Example: Robot Pouring Tea

A 6-DOF robotic arm must pour tea from a teapot into a cup. We have 50 frames (5 seconds at 10 Hz) of expert demonstration data.

Each frame contains:

Camera Image (96x96)

Top-down view showing teapot, cup, and robot arm

Robot State (6D)

[x, y, z, roll, pitch, gripper] — joint positions

Action (6D)

[Δx, Δy, Δz, Δroll, Δpitch, Δgripper] — target deltas

Step 1: Expert Demonstration Data

A human teleoperates the robot to pour tea. We record everything.

8 Key Frames of Tea Pouring

Scrub through the demonstration. The dots on the image show the gripper's next 16 planned waypoints — this is the action trajectory the diffusion model will learn to generate.

Current frame

Camera observation with trajectory overlay

Robot State at Frame 1

Action (target delta)

What Gets Stored in the Dataset

For each timestep t, we store a tuple:

dataset[t] = { observation_image_t, robot_state_t, action_t }

The dataset has 50 such tuples for this one episode. In practice, we'd collect hundreds of episodes.

Step 2: What the Robot Sees

Each observation combines camera images with proprioceptive state.

Observation composition

The policy receives two consecutive camera frames (giving velocity information) plus the robot's joint state vector. Together, these form the observation that conditions the diffusion process.

Why two frames? A single snapshot shows position, but two frames reveal direction and speed — critical for smooth, reactive control.

Step 3: Understanding the Three Horizons

Diffusion Policy uses three different time windows. This is the key concept.

Interactive Horizon Explorer

Move the slider to change the "current timestep" and see how the three horizons shift.

Observation Horizon (To=2): Frames the policy sees
Prediction Horizon (Tp=16): Actions the model predicts
Action Horizon (Ta=8): Actions actually executed

Observation Horizon (To = 2)

The policy looks at the last 2 frames of observations. This gives it velocity information — not just where the robot is, but which direction it's moving.

Prediction Horizon (Tp = 16)

The diffusion model outputs 16 action steps at once. Predicting more than we execute provides temporal consistency.

Action Horizon (Ta = 8)

Of the 16 predicted actions, only the middle 8 are executed. Then we re-plan with fresh observations. This is receding-horizon control.

How a Training Sample is Constructed

For a given current timestep t, we extract from the dataset:

observations = [obst-1, obst]   ← 2 frames (To = 2)
actions = [at-1, at, at+1, ..., at+14]   ← 16 actions (Tp = 16)

The observation is the condition. The action trajectory is what gets noised and denoised.

Step 4: Forward Diffusion (Adding Noise to Actions)

The key insight: we corrupt the expert action trajectory with Gaussian noise.

Forward diffusion process

Noise Level Controller — Planned Path on Scene

The dots show the gripper's 16 planned waypoints (t through t+15). At k=0 the path is a smooth arc from teapot to cup. As k increases, each waypoint scatters — this is what noise does to the action trajectory.

Scene with trajectory

Trajectory overlay on scene — noise jitters the path

6D action dimensions over time

The Math Behind Forward Diffusion

Given a clean action trajectory A0, the forward process adds noise:

Ak = √(āk) · A0 + √(1 - āk) · ε,    where ε ~ N(0, I)

Here, āk goes from ~1 (clean) to ~0 (pure noise).

āk (cosine schedule) — how much signal remains at each step

Step 5: A Single Training Step

This is where the neural network actually learns. Let's trace through one complete training step.

Training Architecture

Training step architecture

Interactive Training Step Walkthrough

1
2
3
4
5
6

Step 6: The Full Training Loop

What happens over thousands of iterations.

Training Loop Simulator

Watch the loss decrease as the network trains. Each iteration samples a random batch, random timestep, random noise.

Training Loss (MSE between predicted and actual noise)

Iteration: 0

Loss:

Sampled timestep k:

Batch sample from frame:


Press "Start Training" to begin.

Pseudocode

for iteration in range(num_iterations): # 1. Sample a batch of training examples batch = dataset.sample(batch_size=64) obs = batch["observations"] # (64, 2, obs_dim) actions = batch["actions"] # (64, 16, 6) # 2. Encode observations into conditioning vector img_features = resnet(obs.images) global_cond = concat(img_features, obs.state) # 3. Sample random noise & random timestep ε = randn_like(actions) # (64, 16, 6) k = randint(0, 100, size=(64,)) # one per sample # 4. Create noisy actions noisy_actions = √(āk) * actions + √(1-āk) * ε # 5. Predict the noise ε_pred = unet(noisy_actions, k, global_cond) # 6. Compute loss & update loss = MSE(ε_pred, ε) loss.backward() optimizer.step()

Step 7: Inference (Generating Actions)

After training, we reverse the diffusion process to generate actions from noise.

Inference process

Interactive Denoising — A Planned Path Emerges from Noise

At k=100, the 16 waypoints are scattered randomly — the robot has no plan. As the network denoises, the waypoints converge into a smooth arc: the gripper's path from teapot to cup.

Scene for inference

Trajectory denoising on scene image

Action dimensions converging to clean trajectory

Inference Algorithm

# Start from pure noise A100 ~ N(0, I) # shape: (1, 16, 6) # Get current observation conditioning obs = get_latest_observations() global_cond = encode(obs) # Iteratively denoise (100 steps DDPM, ~10 for DDIM) for k = 100, 99, 98, ..., 1: ε_pred = unet(Ak, k, global_cond) Ak-1 = denoise_step(Ak, ε_pred, k) # Extract executable actions actions = A0[1:9] → 8 action steps for the robot

Receding Horizon Control

After executing 8 actions, the robot captures new observations and runs inference again. This creates a closed-loop system.