Diffusion Policy Training Visualizer

Understanding how a robot learns to pour tea using diffusion models

The Core Idea

Diffusion Policy learns robot actions by treating the action trajectory (a sequence of future robot movements) as something that can be gradually corrupted with noise and then recovered by a neural network.

During training, we add noise to expert action trajectories and train a neural network to predict that noise. During inference, we start from pure noise and iteratively denoise to generate actions.

Our Example: Robot Pouring Tea

A 6-DOF robotic arm must pour tea from a teapot into a cup. We have 50 frames (5 seconds at 10 Hz) of expert demonstration data.

Each frame contains:

Camera Image (96x96)

Top-down view showing teapot, cup, and robot arm

Robot State (6D)

[x, y, z, roll, pitch, gripper] — joint positions

Action (6D)

[Δx, Δy, Δz, Δroll, Δpitch, Δgripper] — target deltas

Step 1: Expert Demonstration Data

A human teleoperates the robot to pour tea. We record everything.

8 Key Frames of Tea Pouring

Scrub through the demonstration. The dots on the image show the gripper's next 16 planned waypoints — this is the action trajectory the diffusion model will learn to generate.

Scrub through demonstration: Frame 1/8

Camera observation with trajectory overlay

Robot State at Frame 1

Action (target delta)

What Gets Stored in the Dataset

For each timestep t, we store a tuple:

dataset[t] = { observation_image_t, robot_state_t, action_t }

The dataset has 50 such tuples for this one episode. In practice, we'd collect hundreds of episodes.

Step 2: What the Robot Sees

Each observation combines camera images with proprioceptive state.

The policy receives two consecutive camera frames (giving velocity information) plus the robot's joint state vector. Together, these form the observation that conditions the diffusion process.

Why two frames? A single snapshot shows position, but two frames reveal direction and speed — critical for smooth, reactive control.

Step 3: Understanding the Three Horizons

Diffusion Policy uses three different time windows. This is the key concept.

Interactive Horizon Explorer

Move the slider to change the "current timestep" and see how the three horizons shift.

Current timestep: 10

Observation Horizon (T_o=2): Frames the policy sees

Prediction Horizon (T_p=16): Actions the model predicts

Action Horizon (T_a=8): Actions actually executed

Observation Horizon (T_o = 2)

The policy looks at the last 2 frames of observations. This gives it velocity information — not just where the robot is, but which direction it's moving.

Prediction Horizon (T_p = 16)

The diffusion model outputs 16 action steps at once. Predicting more than we execute provides temporal consistency.

Action Horizon (T_a = 8)

Of the 16 predicted actions, only the middle 8 are executed. Then we re-plan with fresh observations. This is receding-horizon control.

How a Training Sample is Constructed

For a given current timestep t, we extract from the dataset:

observations = [obs_t-1, obs_t] ← 2 frames (T_o = 2)
actions = [a_t-1, a_t, a_t+1, ..., a_t+14] ← 16 actions (T_p = 16)

The observation is the condition. The action trajectory is what gets noised and denoised.

Step 4: Forward Diffusion (Adding Noise to Actions)

The key insight: we corrupt the expert action trajectory with Gaussian noise.

Noise Level Controller — Planned Path on Scene

The dots show the gripper's 16 planned waypoints (t through t+15). At k=0 the path is a smooth arc from teapot to cup. As k increases, each waypoint scatters — this is what noise does to the action trajectory.

Diffusion timestep k: 0 / 100 (clean actions)

Trajectory overlay on scene — noise jitters the path

6D action dimensions over time

The Math Behind Forward Diffusion

Given a clean action trajectory A₀, the forward process adds noise:

A_k = √(ā_k) · A₀ + √(1 - ā_k) · ε, where ε ~ N(0, I)

Here, ā_k goes from ~1 (clean) to ~0 (pure noise).

ā_k (cosine schedule) — how much signal remains at each step

Step 6: The Full Training Loop

What happens over thousands of iterations.

Training Loop Simulator

Watch the loss decrease as the network trains. Each iteration samples a random batch, random timestep, random noise.

Training Loss (MSE between predicted and actual noise)

Iteration: 0
Loss: —
Sampled timestep k: —
Batch sample from frame: —
Press "Start Training" to begin.

Pseudocode

for iteration in range(num_iterations):
    # 1. Sample a batch of training examples
    batch = dataset.sample(batch_size=64)
    obs = batch["observations"]   # (64, 2, obs_dim)
    actions = batch["actions"]    # (64, 16, 6)

    # 2. Encode observations into conditioning vector
    img_features = resnet(obs.images)
    global_cond = concat(img_features, obs.state)

    # 3. Sample random noise & random timestep
    ε = randn_like(actions)           # (64, 16, 6)
    k = randint(0, 100, size=(64,))     # one per sample

    # 4. Create noisy actions
    noisy_actions = √(āk) * actions + √(1-āk) * ε

    # 5. Predict the noise
    ε_pred = unet(noisy_actions, k, global_cond)

    # 6. Compute loss & update
    loss = MSE(ε_pred, ε)
    loss.backward()
    optimizer.step()

Step 7: Inference (Generating Actions)

After training, we reverse the diffusion process to generate actions from noise.

Interactive Denoising — A Planned Path Emerges from Noise

At k=100, the 16 waypoints are scattered randomly — the robot has no plan. As the network denoises, the waypoints converge into a smooth arc: the gripper's path from teapot to cup.

Denoising step: 100 → 0

Trajectory denoising on scene image

Action dimensions converging to clean trajectory

Inference Algorithm

# Start from pure noise
A100 ~ N(0, I)           # shape: (1, 16, 6)

# Get current observation conditioning
obs = get_latest_observations()
global_cond = encode(obs)

# Iteratively denoise (100 steps DDPM, ~10 for DDIM)
for k = 100, 99, 98, ..., 1:
    ε_pred = unet(Ak, k, global_cond)
    Ak-1 = denoise_step(Ak, ε_pred, k)

# Extract executable actions
actions = A0[1:9]  → 8 action steps for the robot

Receding Horizon Control

After executing 8 actions, the robot captures new observations and runs inference again. This creates a closed-loop system.

Diffusion Policy Training Visualizer

The Core Idea

Our Example: Robot Pouring Tea

Camera Image (96x96)

Robot State (6D)

Action (6D)

Step 1: Expert Demonstration Data

8 Key Frames of Tea Pouring

Robot State at Frame 1

Action (target delta)

What Gets Stored in the Dataset

Step 2: What the Robot Sees

Step 3: Understanding the Three Horizons

Interactive Horizon Explorer

Observation Horizon (T_o = 2)

Prediction Horizon (T_p = 16)

Action Horizon (T_a = 8)

How a Training Sample is Constructed

Step 4: Forward Diffusion (Adding Noise to Actions)

Noise Level Controller — Planned Path on Scene

The Math Behind Forward Diffusion

Step 5: A Single Training Step

Training Architecture

Interactive Training Step Walkthrough

Step 6: The Full Training Loop

Training Loop Simulator

Pseudocode

Step 7: Inference (Generating Actions)

Interactive Denoising — A Planned Path Emerges from Noise

Inference Algorithm

Receding Horizon Control

Diffusion Policy Training Visualizer

The Core Idea

Our Example: Robot Pouring Tea

Camera Image (96x96)

Robot State (6D)

Action (6D)

Step 1: Expert Demonstration Data

8 Key Frames of Tea Pouring

Robot State at Frame 1

Action (target delta)

What Gets Stored in the Dataset

Step 2: What the Robot Sees

Step 3: Understanding the Three Horizons

Interactive Horizon Explorer

Observation Horizon (To = 2)

Prediction Horizon (Tp = 16)

Action Horizon (Ta = 8)

How a Training Sample is Constructed

Step 4: Forward Diffusion (Adding Noise to Actions)

Noise Level Controller — Planned Path on Scene

The Math Behind Forward Diffusion

Step 5: A Single Training Step

Training Architecture

Interactive Training Step Walkthrough

Step 6: The Full Training Loop

Training Loop Simulator

Pseudocode

Step 7: Inference (Generating Actions)

Interactive Denoising — A Planned Path Emerges from Noise

Inference Algorithm

Receding Horizon Control

Observation Horizon (T_o = 2)

Prediction Horizon (T_p = 16)

Action Horizon (T_a = 8)