Input & Output Specifications

This page provides comprehensive specifications for Alpamayo 1’s input and output formats, including tensor shapes, data types, coordinate systems, and example usage.

Model Inputs

Alpamayo 1 requires two primary types of inputs: multi-camera video frames and egomotion history.

Multi-Camera Video

Alpamayo 1 processes video from multiple camera viewpoints to build a comprehensive understanding of the driving scene.

Image Format

# From alpamayo_r1/test_inference.py:30-36
data = load_physical_aiavdataset(clip_id, t0_us=5_100_000)
messages = helper.create_message(data["image_frames"].flatten(0, 1))

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)

Expected format:

Type: RGB images
Channels: 3 (RGB)
Preprocessing: Handled by AutoProcessor from Qwen3-VL
Resolution: Variable (controlled by min_pixels and max_pixels config)
Cameras: Multiple viewpoints (e.g., front, left, right, rear)

The Qwen3-VL processor automatically handles image resizing, normalization, and patch tokenization. You don’t need to manually preprocess images.

Processor Configuration

# From alpamayo_r1/models/base_model.py:251-259
processor_kwargs = {}
if self.min_pixels is not None:
    processor_kwargs["min_pixels"] = self.min_pixels
if self.max_pixels is not None:
    processor_kwargs["max_pixels"] = self.max_pixels

processor = AutoProcessor.from_pretrained(
    self.vlm_name_or_path, **processor_kwargs
)

Resolution parameters:

min_pixels: Minimum image resolution (default: depends on Qwen3-VL config)
max_pixels: Maximum image resolution (default: depends on Qwen3-VL config)
Higher resolutions provide more detail but increase memory usage

Egomotion History

Egomotion history provides the vehicle’s past trajectory, enabling the model to infer current velocity and predict smooth future motion.

History XYZ Positions

ego_history_xyz: torch.Tensor

Shape: (batch_size, num_trajectories, history_length, 3) Typical values:

batch_size: Usually 1 for inference
num_trajectories: 1 (single ego vehicle)
history_length: Variable (e.g., 20 timesteps = 2 seconds at 10 Hz)
3: (x, y, z) coordinates in meters

Coordinate frame:

Ego-centric: Last history position (t=0) should be at origin (0, 0, 0)
X-axis: Forward
Y-axis: Left
Z-axis: Up

The last history waypoint ego_history_xyz[..., -1, :] represents the current position and should be [0, 0, 0] in the ego frame.

History Rotations

ego_history_rot: torch.Tensor

Shape: (batch_size, num_trajectories, history_length, 3, 3) Format: SO(3) rotation matrices (not quaternions or Euler angles) Typical values:

Same batch dimensions as ego_history_xyz
3 × 3: Rotation matrix representing vehicle orientation

Orientation convention:

Rotation from world frame to ego frame at each timestep
Last rotation ego_history_rot[..., -1, :, :] represents current heading

# Example: Extract yaw from rotation matrix
# From alpamayo_r1/action_space/unicycle_accel_curvature.py:215
theta = so3_to_yaw_torch(traj_history_rot)

Tokenized Input

After processing images and creating chat messages, the input is tokenized:

# From alpamayo_r1/test_inference.py:38-50
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=False,
    continue_final_message=True,
    return_dict=True,
    return_tensors="pt",
)

model_inputs = {
    "tokenized_data": inputs,
    "ego_history_xyz": data["ego_history_xyz"],
    "ego_history_rot": data["ego_history_rot"],
}

Tokenized data contents:

input_ids: Token IDs for images and text
attention_mask: Attention mask for padding
pixel_values: Processed image tensors (if applicable)
Additional processor-specific keys

Complete Input Example

import torch
from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
from alpamayo_r1 import helper

# Load model
model = AlpamayoR1.from_pretrained(
    "nvidia/Alpamayo-R1-10B", 
    dtype=torch.bfloat16
).to("cuda")

processor = helper.get_processor(model.tokenizer)

# Prepare inputs
image_frames = ...  # Your multi-camera images
ego_history_xyz = torch.zeros(1, 1, 20, 3)  # Example: 2s history
ego_history_rot = torch.eye(3).unsqueeze(0).unsqueeze(0).unsqueeze(0).repeat(
    1, 1, 20, 1, 1
)  # Example: no rotation

messages = helper.create_message(image_frames)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)

model_inputs = {
    "tokenized_data": inputs,
    "ego_history_xyz": ego_history_xyz.to("cuda"),
    "ego_history_rot": ego_history_rot.to("cuda"),
}

Model Outputs

Alpamayo 1 produces three types of outputs: trajectory predictions, rotation predictions, and Chain-of-Causation reasoning traces.

Trajectory Predictions (XYZ)

pred_xyz: torch.Tensor

Shape: (batch_size, num_traj_sets, num_traj_samples, num_waypoints, 3) Example shape: (1, 1, 6, 64, 3) for 6 samples Dimensions:

batch_size: Number of input scenes (typically 1)
num_traj_sets: Number of independent sampling runs (typically 1)
num_traj_samples: Number of trajectory samples per input (e.g., 1-10)
num_waypoints: 64 (fixed at 10 Hz for 6.4 seconds)
3: (x, y, z) coordinates in meters

Coordinate system:

Origin: Current ego position (last history waypoint)
Frame: Ego-centric (not world coordinates)
X-axis: Forward direction
Y-axis: Left direction
Z-axis: Up direction (often constant for ground vehicles)

# From alpamayo_r1/test_inference.py:68-69
gt_xy = data["ego_future_xyz"].cpu()[0, 0, :, :2].T.numpy()
pred_xy = pred_xyz.cpu().numpy()[0, 0, :, :, :2].transpose(0, 2, 1)

Important: Trajectories are in the ego frame at the current timestep, not world coordinates. You must transform predictions to world coordinates using the ego vehicle’s current pose.

Rotation Predictions (SO(3))

pred_rot: torch.Tensor

Shape: (batch_size, num_traj_sets, num_traj_samples, num_waypoints, 3, 3) Example shape: (1, 1, 6, 64, 3, 3) for 6 samples Format: SO(3) rotation matrices

Each 3×3 matrix is orthogonal with determinant +1
Represents vehicle heading at each waypoint
Not quaternions or Euler angles

Coordinate convention:

Rotation from ego frame to waypoint frame
Consistent with ego_history_rot format

# Convert to yaw angle if needed
from alpamayo_r1.geometry.rotation import so3_to_yaw_torch

yaw_angles = so3_to_yaw_torch(pred_rot)  # Extract heading angles

Chain-of-Causation Reasoning

extra: dict[str, np.ndarray]

Returned when return_extra=True in the inference call. Contents:

extra = {
    "cot": np.ndarray,  # Chain-of-Causation traces
    # Additional keys for other extracted tokens
}

Shape: extra["cot"] has shape (batch_size, num_traj_sets, num_traj_samples) Type: Each element is a string containing the reasoning trace

# From alpamayo_r1/test_inference.py:66
print("Chain-of-Causation (per trajectory):\n", extra["cot"][0])

Example output:

[
  [
    [
      "The vehicle ahead is braking, indicated by brake lights. "
      "There is a pedestrian on the right sidewalk approaching the crosswalk. "
      "The ego vehicle should decelerate smoothly to maintain safe following distance "
      "and prepare to stop if the pedestrian enters the crosswalk."
    ]
  ]
]

Each trajectory sample gets its own CoC trace when num_traj_samples > 1 due to stochastic generation.

Return Value Formats

The main inference method returns different outputs based on return_extra:

# From alpamayo_r1/models/alpamayo_r1.py:122-328
def sample_trajectories_from_data_with_vlm_rollout(
    self,
    data: dict[str, Any],
    top_p: float = 0.98,
    top_k: int | None = None,
    temperature: float = 0.6,
    num_traj_samples: int = 6,
    num_traj_sets: int = 1,
    diffusion_kwargs: dict[str, Any] | None = None,
    *args: Any,
    **kwargs: Any,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

Return types: Without extra (return_extra=False, default):

pred_xyz, pred_rot = model.sample_trajectories_from_data_with_vlm_rollout(
    data=model_inputs,
    return_extra=False,
)
# Returns: (torch.Tensor, torch.Tensor)

With extra (return_extra=True):

pred_xyz, pred_rot, extra = model.sample_trajectories_from_data_with_vlm_rollout(
    data=model_inputs,
    return_extra=True,
)
# Returns: (torch.Tensor, torch.Tensor, dict)

Data Types & Precision

Input Data Types

# Recommended configuration
model = AlpamayoR1.from_pretrained(
    "nvidia/Alpamayo-R1-10B",
    dtype=torch.bfloat16,  # Memory-efficient 16-bit precision
).to("cuda")

with torch.autocast("cuda", dtype=torch.bfloat16):
    pred_xyz, pred_rot, extra = model.sample_trajectories_from_data_with_vlm_rollout(
        data=model_inputs,
        # ...
    )

Supported dtypes:

torch.bfloat16: Recommended (24 GB VRAM)
torch.float16: Alternative (may have numerical issues)
torch.float32: Full precision (requires >40 GB VRAM)

Output Data Types

Outputs match the model’s dtype:

print(pred_xyz.dtype)  # torch.bfloat16 (if model is bfloat16)
print(pred_rot.dtype)  # torch.bfloat16

Convert to float32 for downstream processing if needed:

pred_xyz_fp32 = pred_xyz.float()  # torch.float32

Temporal Specifications

Timestep Information

Property	Value	Notes
Frequency	10 Hz	Waypoints sampled every 0.1 seconds
Waypoints	64	Fixed number
Horizon	6.4 seconds	Total prediction duration
Time interval	`dt = 0.1`	Defined in action space config

Waypoint timestamps:

import numpy as np

timestamps = np.arange(1, 65) * 0.1  # [0.1, 0.2, ..., 6.4] seconds
# Note: First waypoint is 0.1s in the future, not current time

History vs. Future

History              Current    Future Prediction
<------------------->|<-------------------------->
  t=-2s  ...  t=0s   |  t=0.1s  ...  t=6.4s
                     ^
                     Origin for predictions

History length: Variable (e.g., 20 timesteps = 2 seconds)
Current time: t=0, origin of prediction frame
Future horizon: 64 timesteps = 6.4 seconds

Complete Usage Example

import torch
import numpy as np
from alpamayo_r1.models.alpamayo_r1 import AlpamayoR1
from alpamayo_r1.load_physical_aiavdataset import load_physical_aiavdataset
from alpamayo_r1 import helper

# Load data
clip_id = "030c760c-ae38-49aa-9ad8-f5650a545d26"
data = load_physical_aiavdataset(clip_id, t0_us=5_100_000)

# Load model
model = AlpamayoR1.from_pretrained(
    "nvidia/Alpamayo-R1-10B", 
    dtype=torch.bfloat16
).to("cuda")

processor = helper.get_processor(model.tokenizer)

# Prepare inputs
messages = helper.create_message(data["image_frames"].flatten(0, 1))
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)

model_inputs = {
    "tokenized_data": inputs,
    "ego_history_xyz": data["ego_history_xyz"].to("cuda"),
    "ego_history_rot": data["ego_history_rot"].to("cuda"),
}

# Run inference
torch.cuda.manual_seed_all(42)
with torch.autocast("cuda", dtype=torch.bfloat16):
    pred_xyz, pred_rot, extra = model.sample_trajectories_from_data_with_vlm_rollout(
        data=model_inputs,
        top_p=0.98,
        temperature=0.6,
        num_traj_samples=6,
        max_generation_length=256,
        return_extra=True,
    )

# Print output shapes
print(f"Predicted XYZ shape: {pred_xyz.shape}")  # (1, 1, 6, 64, 3)
print(f"Predicted rotations shape: {pred_rot.shape}")  # (1, 1, 6, 64, 3, 3)
print(f"CoC shape: {extra['cot'].shape}")  # (1, 1, 6)

# Access individual samples
batch_idx = 0
set_idx = 0

for sample_idx in range(6):
    trajectory = pred_xyz[batch_idx, set_idx, sample_idx]  # (64, 3)
    reasoning = extra["cot"][batch_idx, set_idx, sample_idx]  # str
    
    print(f"\nSample {sample_idx}:")
    print(f"First waypoint (t=0.1s): {trajectory[0]}")
    print(f"Last waypoint (t=6.4s): {trajectory[-1]}")
    print(f"Reasoning: {reasoning[:100]}...")  # First 100 chars

# Evaluate minADE
gt_xy = data["ego_future_xyz"].cpu()[0, 0, :, :2].T.numpy()
pred_xy = pred_xyz.cpu().numpy()[0, 0, :, :, :2].transpose(0, 2, 1)
diff = np.linalg.norm(pred_xy - gt_xy[None, ...], axis=1).mean(-1)
min_ade = diff.min()
print(f"\nMinimum ADE: {min_ade:.3f} meters")

Input Validation

Common issues and how to avoid them:

Shape Mismatches

Error: ego_history_xyz and ego_history_rot must have compatible shapes.✅ Correct: Both have shape (B, N, T, ...)
❌ Wrong: Different batch sizes or trajectory counts

# Correct
ego_history_xyz.shape  # (1, 1, 20, 3)
ego_history_rot.shape  # (1, 1, 20, 3, 3)

# Wrong - mismatched history length
ego_history_xyz.shape  # (1, 1, 20, 3)
ego_history_rot.shape  # (1, 1, 15, 3, 3)  # Error!

Device Mismatches

# Ensure all inputs are on the same device as the model
model_inputs = helper.to_device(model_inputs, "cuda")

Origin Convention

The last history position should be at the origin (0, 0, 0) in the ego frame. If your data uses a different convention, transform it before passing to the model.

# Transform to ego frame
ego_history_xyz_ego = ego_history_xyz - ego_history_xyz[:, :, -1:, :]
# Now ego_history_xyz_ego[..., -1, :] == [0, 0, 0]

Memory Requirements

Input size affects memory usage:

Configuration	VRAM Usage	Notes
Base model (bfloat16)	~20 GB	No inference
+ 1 sample inference	~22 GB	Minimum configuration
+ 6 samples inference	~24 GB	Typical multi-sample
+ 10 samples inference	~26 GB	High coverage
Higher image resolution	+2-4 GB	Depends on `max_pixels`

GPUs with less than 24 GB VRAM will likely encounter CUDA out-of-memory errors with multi-sample generation.

Reducing memory usage:

Use num_traj_samples=1
Lower max_pixels in processor config
Process smaller batches
Enable gradient checkpointing (training only)

Next Steps

Architecture

Understand how inputs flow through the model

Trajectory Prediction

Learn how outputs are generated

Chain-of-Causation

Interpret reasoning outputs

Get Started

Core Concepts

Guides

Model Components

Input & Output Specifications

Model Inputs

Multi-Camera Video

Image Format

Processor Configuration

Egomotion History

History XYZ Positions

History Rotations

Tokenized Input

Complete Input Example

Model Outputs

Trajectory Predictions (XYZ)

Rotation Predictions (SO(3))

Chain-of-Causation Reasoning

Return Value Formats

Data Types & Precision

Input Data Types

Output Data Types

Temporal Specifications

Timestep Information

History vs. Future

Complete Usage Example

Input Validation

Shape Mismatches

Device Mismatches

Origin Convention

Memory Requirements

Next Steps

Architecture

Trajectory Prediction

Chain-of-Causation

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Model Components

Documentation Index

​Model Inputs

​Multi-Camera Video

​Image Format

​Processor Configuration

​Egomotion History

​History XYZ Positions

​History Rotations

​Tokenized Input

​Complete Input Example

​Model Outputs

​Trajectory Predictions (XYZ)

​Rotation Predictions (SO(3))

​Chain-of-Causation Reasoning

​Return Value Formats

​Data Types & Precision

​Input Data Types

​Output Data Types

​Temporal Specifications

​Timestep Information

​History vs. Future

​Complete Usage Example

​Input Validation

​Shape Mismatches

​Device Mismatches

​Origin Convention

​Memory Requirements

​Next Steps

Architecture

Trajectory Prediction

Chain-of-Causation

Build docs developers (and LLMs) love

Model Inputs

Multi-Camera Video

Image Format

Processor Configuration

Egomotion History

History XYZ Positions

History Rotations

Tokenized Input

Complete Input Example

Model Outputs

Trajectory Predictions (XYZ)

Rotation Predictions (SO(3))

Chain-of-Causation Reasoning

Return Value Formats

Data Types & Precision

Input Data Types

Output Data Types

Temporal Specifications

Timestep Information

History vs. Future

Complete Usage Example

Input Validation

Shape Mismatches

Device Mismatches

Origin Convention

Memory Requirements

Next Steps