Quickstart
This guide shows you how to run inference with Alpamayo 1 and generate trajectory predictions with Chain-of-Causation reasoning.Prerequisites
Before starting, ensure you have:- Completed the installation steps
- Activated your virtual environment
- Authenticated with HuggingFace
- An NVIDIA GPU with ≥24 GB VRAM
Run the test inference script
The simplest way to get started is using the provided test script:The first run will download example data and model weights (22 GB). Subsequent runs will use cached weights.
- Load a sample clip from the PhysicalAI-AV dataset
- Run inference to predict trajectories
- Generate Chain-of-Causation reasoning traces
- Compute the minADE (minimum Average Displacement Error) metric
Understanding the code
Here’s how the inference pipeline works:Load the dataset
Load a specific clip from the PhysicalAI-AV dataset:The dataset includes multi-camera images, ego vehicle history (position and rotation), and ground truth trajectories.
Load the model and processor
Load the pre-trained Alpamayo 1 model:The model uses bfloat16 precision for efficient GPU memory usage.
Run inference
Generate trajectory predictions with Chain-of-Causation reasoning:
You can increase
num_traj_samples to generate multiple trajectory hypotheses, but this requires more GPU memory.Complete example
Here’s the full inference script:Understanding the outputs
Alpamayo 1 produces two key outputs:Trajectory predictions
- Format:
pred_xyzwith shape[batch_size, num_traj_sets, num_traj_samples, 64, 3] - Content: 64 waypoints representing 6.4 seconds of predicted vehicle motion (10 Hz)
- Coordinates: XYZ positions in the ego vehicle’s coordinate frame
Chain-of-Causation reasoning
- Format: Natural language text in
extra["cot"] - Content: Explanations of the causal factors influencing the predicted trajectory
- Example: “The vehicle ahead is slowing down due to traffic. I should reduce speed and maintain safe following distance.”
Interactive notebook
For visual exploration and trajectory visualization, use the included Jupyter notebook:- Multi-camera image visualization
- Trajectory plotting (predicted vs. ground truth)
- Interactive parameter tuning
- Matplotlib-based visualizations
Inference parameters
You can customize inference behavior with these parameters:| Parameter | Default | Description |
|---|---|---|
top_p | 0.98 | Nucleus sampling threshold for token generation |
temperature | 0.6 | Sampling temperature (higher = more diverse) |
num_traj_samples | 1 | Number of trajectory samples to generate |
max_generation_length | 256 | Maximum length for reasoning text generation |
Expected variability
Vision-Language-Action models produce non-deterministic outputs due to:- Trajectory sampling during inference
- Hardware differences across GPUs
- Floating-point precision variations
num_traj_samples=1, you may observe variance in minADE metrics across runs. This is expected behavior. For more stable evaluation, increase num_traj_samples or use the interactive notebook for visual sanity checks.
Next steps
Model architecture
Learn about the Vision-Language-Action architecture and Chain-of-Causation reasoning
HuggingFace model card
Read comprehensive details on inputs, outputs, and licensing
Research paper
Explore the technical details in the arXiv paper
Dataset
Browse the PhysicalAI-AV dataset on HuggingFace