Skip to main content

Quickstart

This guide shows you how to run inference with Alpamayo 1 and generate trajectory predictions with Chain-of-Causation reasoning.

Prerequisites

Before starting, ensure you have:
  • Completed the installation steps
  • Activated your virtual environment
  • Authenticated with HuggingFace
  • An NVIDIA GPU with ≥24 GB VRAM

Run the test inference script

The simplest way to get started is using the provided test script:
python src/alpamayo_r1/test_inference.py
The first run will download example data and model weights (22 GB). Subsequent runs will use cached weights.
This script will:
  1. Load a sample clip from the PhysicalAI-AV dataset
  2. Run inference to predict trajectories
  3. Generate Chain-of-Causation reasoning traces
  4. Compute the minADE (minimum Average Displacement Error) metric

Understanding the code

Here’s how the inference pipeline works:
1

Load the dataset

Load a specific clip from the PhysicalAI-AV dataset:
from alpamayo_r1.load_physical_aiavdataset import load_physical_aiavdataset

clip_id = "030c760c-ae38-49aa-9ad8-f5650a545d26"
data = load_physical_aiavdataset(clip_id, t0_us=5_100_000)
The dataset includes multi-camera images, ego vehicle history (position and rotation), and ground truth trajectories.
2

Load the model and processor

Load the pre-trained Alpamayo 1 model:
import torch
from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
from alpamayo_r1 import helper

model = AlpamayoR1.from_pretrained(
    "nvidia/Alpamayo-R1-10B", 
    dtype=torch.bfloat16
).to("cuda")

processor = helper.get_processor(model.tokenizer)
The model uses bfloat16 precision for efficient GPU memory usage.
3

Prepare the inputs

Create message format and tokenize inputs:
messages = helper.create_message(data["image_frames"].flatten(0, 1))

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=False,
    continue_final_message=True,
    return_dict=True,
    return_tensors="pt",
)

model_inputs = {
    "tokenized_data": inputs,
    "ego_history_xyz": data["ego_history_xyz"],
    "ego_history_rot": data["ego_history_rot"],
}
model_inputs = helper.to_device(model_inputs, "cuda")
4

Run inference

Generate trajectory predictions with Chain-of-Causation reasoning:
torch.cuda.manual_seed_all(42)
with torch.autocast("cuda", dtype=torch.bfloat16):
    pred_xyz, pred_rot, extra = model.sample_trajectories_from_data_with_vlm_rollout(
        data=model_inputs,
        top_p=0.98,
        temperature=0.6,
        num_traj_samples=1,
        max_generation_length=256,
        return_extra=True,
    )

# View Chain-of-Causation reasoning
print("Chain-of-Causation (per trajectory):\n", extra["cot"][0])
You can increase num_traj_samples to generate multiple trajectory hypotheses, but this requires more GPU memory.
5

Evaluate predictions

Compare predictions against ground truth:
import numpy as np

gt_xy = data["ego_future_xyz"].cpu()[0, 0, :, :2].T.numpy()
pred_xy = pred_xyz.cpu().numpy()[0, 0, :, :, :2].transpose(0, 2, 1)
diff = np.linalg.norm(pred_xy - gt_xy[None, ...], axis=1).mean(-1)
min_ade = diff.min()
print("minADE:", min_ade, "meters")
The minADE (minimum Average Displacement Error) measures the average distance between the best predicted trajectory and ground truth.

Complete example

Here’s the full inference script:
import torch
import numpy as np

from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
from alpamayo_r1.load_physical_aiavdataset import load_physical_aiavdataset
from alpamayo_r1 import helper

# Load dataset
clip_id = "030c760c-ae38-49aa-9ad8-f5650a545d26"
print(f"Loading dataset for clip_id: {clip_id}...")
data = load_physical_aiavdataset(clip_id, t0_us=5_100_000)
print("Dataset loaded.")
messages = helper.create_message(data["image_frames"].flatten(0, 1))

# Load model and processor
model = AlpamayoR1.from_pretrained("nvidia/Alpamayo-R1-10B", dtype=torch.bfloat16).to("cuda")
processor = helper.get_processor(model.tokenizer)

# Prepare inputs
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=False,
    continue_final_message=True,
    return_dict=True,
    return_tensors="pt",
)
model_inputs = {
    "tokenized_data": inputs,
    "ego_history_xyz": data["ego_history_xyz"],
    "ego_history_rot": data["ego_history_rot"],
}
model_inputs = helper.to_device(model_inputs, "cuda")

# Run inference
torch.cuda.manual_seed_all(42)
with torch.autocast("cuda", dtype=torch.bfloat16):
    pred_xyz, pred_rot, extra = model.sample_trajectories_from_data_with_vlm_rollout(
        data=model_inputs,
        top_p=0.98,
        temperature=0.6,
        num_traj_samples=1,
        max_generation_length=256,
        return_extra=True,
    )

# View reasoning and metrics
print("Chain-of-Causation (per trajectory):\n", extra["cot"][0])

gt_xy = data["ego_future_xyz"].cpu()[0, 0, :, :2].T.numpy()
pred_xy = pred_xyz.cpu().numpy()[0, 0, :, :, :2].transpose(0, 2, 1)
diff = np.linalg.norm(pred_xy - gt_xy[None, ...], axis=1).mean(-1)
min_ade = diff.min()
print("minADE:", min_ade, "meters")

Understanding the outputs

Alpamayo 1 produces two key outputs:

Trajectory predictions

  • Format: pred_xyz with shape [batch_size, num_traj_sets, num_traj_samples, 64, 3]
  • Content: 64 waypoints representing 6.4 seconds of predicted vehicle motion (10 Hz)
  • Coordinates: XYZ positions in the ego vehicle’s coordinate frame

Chain-of-Causation reasoning

  • Format: Natural language text in extra["cot"]
  • Content: Explanations of the causal factors influencing the predicted trajectory
  • Example: “The vehicle ahead is slowing down due to traffic. I should reduce speed and maintain safe following distance.”

Interactive notebook

For visual exploration and trajectory visualization, use the included Jupyter notebook:
jupyter notebook notebooks/inference.ipynb
The notebook includes:
  • Multi-camera image visualization
  • Trajectory plotting (predicted vs. ground truth)
  • Interactive parameter tuning
  • Matplotlib-based visualizations

Inference parameters

You can customize inference behavior with these parameters:
ParameterDefaultDescription
top_p0.98Nucleus sampling threshold for token generation
temperature0.6Sampling temperature (higher = more diverse)
num_traj_samples1Number of trajectory samples to generate
max_generation_length256Maximum length for reasoning text generation
Increasing num_traj_samples generates multiple trajectory hypotheses but significantly increases GPU memory usage. Start with 1 and increase gradually.

Expected variability

Vision-Language-Action models produce non-deterministic outputs due to:
  • Trajectory sampling during inference
  • Hardware differences across GPUs
  • Floating-point precision variations
With num_traj_samples=1, you may observe variance in minADE metrics across runs. This is expected behavior. For more stable evaluation, increase num_traj_samples or use the interactive notebook for visual sanity checks.

Next steps

Model architecture

Learn about the Vision-Language-Action architecture and Chain-of-Causation reasoning

HuggingFace model card

Read comprehensive details on inputs, outputs, and licensing

Research paper

Explore the technical details in the arXiv paper

Dataset

Browse the PhysicalAI-AV dataset on HuggingFace

Build docs developers (and LLMs) love