Architecture Components
The Alpamayo 1 architecture consists of four primary components working in concert:1. Vision-Language-Action Backbone (VLM)
The foundation of Alpamayo 1 is built on the Qwen3-VL architecture, a state-of-the-art vision-language model:- Model: Qwen3-VL-8B-Instruct (10B parameters total)
- Implementation:
Qwen3VLForConditionalGenerationfrom HuggingFace Transformers - Purpose: Processes multi-camera video streams and generates Chain-of-Causation reasoning traces
- Discrete trajectory tokens: 768 tokens (
<i0>through<i767>) - Special tokens:
<|traj_history_start|>,<|traj_future_start|>,<|cot_start|>, etc. - Total vocabulary: Original vocab + 768 discrete tokens + special tokens
2. Expert Model
The expert model is a specialized decoder that processes the VLM’s output to predict actions:- Shares text architecture configuration with the VLM
- Uses non-causal attention by default (
expert_non_causal_attention=True) - Reuses the VLM’s KV cache for efficient inference
- No separate embedding layer (embeddings come from action projection)
3. Action Space
Alpamayo 1 uses a Unicycle Kinematic Model with acceleration and curvature as control inputs:- Dimensions:
(64, 2)- 64 waypoints, 2 controls per waypoint - Controls:
[acceleration, curvature]at each waypoint - Temporal resolution: 10 Hz (0.1s intervals via
dt=0.1) - Prediction horizon: 6.4 seconds (64 waypoints × 0.1s)
4. Diffusion Decoder
Alpamayo 1 employs Flow Matching, a continuous normalizing flow approach for trajectory generation:- Integration method: Euler integration
- Inference steps: 10 steps (configurable)
- Input: Random Gaussian noise → Output: Trajectory actions
- Denoising: Expert model predicts velocity field at each timestep
Information Flow
The complete inference pipeline follows this sequence:Phase 1: VLM Processing
- Input tokenization: Multi-camera images + egomotion history → token sequence
- History trajectory fusion: Discrete trajectory tokens inserted into input
- Autoregressive generation: VLM generates Chain-of-Causation text
- KV cache extraction: Store attention cache for expert model
Phase 2: Expert Processing & Diffusion
- Diffusion initialization: Sample random noise
x ~ N(0, I)in action space - Iterative denoising (10 steps):
- Project noisy action → token embeddings via
action_in_proj - Run expert model with cached VLM context
- Predict velocity field via
action_out_proj - Update action:
x = x + dt * velocity_field
- Project noisy action → token embeddings via
- Action-to-trajectory conversion: Map final actions → XYZ positions + rotations
Phase 3: Output Generation
Outputs (per input sample):- Trajectories:
(num_traj_samples, 64, 3)- XYZ waypoints at 10 Hz - Rotations:
(num_traj_samples, 64, 3, 3)- SO(3) rotation matrices - Chain-of-Causation: Text reasoning trace explaining the prediction
Model Configuration
Key configuration parameters fromAlpamayoR1Config:
| Parameter | Default | Description |
|---|---|---|
vlm_name_or_path | "Qwen/Qwen3-VL-8B-Instruct" | Vision-language backbone |
traj_vocab_size | 768 | Number of discrete trajectory tokens |
tokens_per_history_traj | 16 | Tokens encoding egomotion history |
tokens_per_future_traj | 64 | Tokens for future trajectory |
model_dtype | "bfloat16" | Model precision |
attn_implementation | "flash_attention_2" | Attention mechanism |
expert_non_causal_attention | True | Expert uses bidirectional attention |
keep_same_dtype | True | Diffusion/action modules match expert dtype |
Memory & Compute Requirements
Model footprint:- VLM backbone: ~8B parameters
- Expert model: ~2B parameters
- Action projections + diffusion: <100M parameters
- Total: ~10B parameters
- Flash Attention 2 for efficient attention computation
- KV cache reuse between VLM and expert
- bfloat16 precision to reduce memory usage
Next Steps
Chain-of-Causation
Learn how Alpamayo 1 generates reasoning traces
Trajectory Prediction
Understand trajectory generation and diffusion sampling
Inputs & Outputs
Detailed specifications for model I/O