Performance Tuning
Replica Counts and GPU Distribution
The number of service replicas and their GPU assignments are configured in deployment configs located insrc/wizard/configs/deploy/.
For local workstation: local_oss.yaml
Understanding the Configuration
Each service has two key parameters:- One container per GPU (or one container total if
gpus: null) - Each container runs
replicas_per_containerservice instances - Total replicas =
nr_gpus * replicas_per_container
gpus: [0, 1, 2, 3]→ 4 containers (one per GPU)replicas_per_container: 4→ 4 replicas per container- Total: 4 × 4 = 16 service replicas
Balancing Replicas and Concurrent Rollouts
Total simulation throughput capacity is determined by:n_concurrent_rollouts is the number of rollouts (simulation episodes) each service replica can process simultaneously.
Example from local_oss.yaml scaled up:
Changing Inference Frequency
Changing inference frequency requires coordinating multiple timing parameters.Understanding Timing Parameters
The simulator has multiple synchronized “clocks”:- Driver inference (
control_timestep_us) - How often the model makes decisions - Camera frames (
frame_interval_us) - How often cameras capture images - GPS/Pose updates (
egopose_interval_us) - How often position is updated - Simulation start (
time_start_offset_us) - Initial offset to avoid artifacts
For correct operation, these parameters must be mathematically aligned.
Scenario 1: Simple Frequency Change
To change to 5Hz inference (200ms between decisions):
Full command:
Scenario 2: High-Rate Camera with Lower Inference
To use 30Hz cameras (33.3ms) but 10Hz inference (100ms):- Camera captures at 30Hz:
frame_interval_us=33334(33.3ms) - Inference runs at 10Hz:
control_timestep_us=100002(must be 3 × 33334) - Subsample frames:
driver.inference.Cframes_subsample=3(use every 3rd frame) - Egopose matches inference:
egopose_interval_us=100002 - Time offset aligns:
time_start_offset_us=300006(3 × 100002)
Common Frequencies
| Frequency | control_timestep_us | egopose_interval_us | time_start_offset_us | Notes |
|---|---|---|---|---|
| 2Hz | 500000 (500ms) | 500000 | 500000 or 1500000 | VaVAM default |
| 5Hz | 200000 (200ms) | 200000 | 600000 (3×) | Example config |
| 10Hz | 100000 (100ms) | 100000 | 300000 (3×) | Base default |
| 30Hz | 33334 (33.3ms) | 33334 | 100002 (3×) | High frequency |
Most configs use
time_start_offset_us = 3 × control_timestep_us to avoid artifacts at scene start.Validation
Theassert_zero_decision_delay flag (enabled by default) validates timing synchronization at runtime:
- Camera frames complete exactly at decision time
- Egopose updates complete exactly at decision time
Viewing Results and Metrics
Results Directory Structure
After a run completes, results are inwizard.log_dir (e.g., runs/{RUN_DIR}/):
rollouts/
rollouts/
Simulation logs organized by scene and batch:
rollouts/{scene_id}/{batch_uuid}/rollout.asl- Full simulation logrollouts/{scene_id}/{batch_uuid}/metrics.parquet- Per-rollout metricsrollouts/{scene_id}/{batch_uuid}/{clipgt_id}_{batch_id}_{rollout_id}.mp4- Evaluation video_complete- Marker file indicating successful completion
aggregate/
aggregate/
Aggregated results across all rollouts:
metrics_results.txt- Formatted table of driving scores (mean, std, quantiles)metrics_results.png- Visual summary of driving quality metricsmetrics_unprocessed.parquet- Combined metrics from all rolloutsvideos/- Videos organized by violation types (collision_at_fault, offroad, etc.)
telemetry/
telemetry/
Performance profiling data:
metrics.prom- Prometheus metrics from simulationmetrics_plot.png- Performance visualization (CPU/GPU/RPC metrics)
txt-logs/
txt-logs/
Per-service debug logs for troubleshooting
wizard-config.yaml
wizard-config.yaml
Resolved configuration used for this run (after Hydra inheritance)
Understanding Driving Quality Metrics
The simulation evaluates driving quality across multiple dimensions. Results are inaggregate/metrics_results.txt and visualized in aggregate/metrics_results.png.
Safety Metrics (Binary)
0 = pass, 1 = failcollision_at_fault: Driver caused a collision (front/lateral impact)collision_rear: Rear-end collision (not at fault)offroad: Vehicle drove off the road
Performance Metrics (Continuous)
-
dist_to_gt_trajectory: Maximum distance from ground truth path (meters)- Lower is better; indicates how closely the driver follows expected routes
- Aggregated using MAX over time (worst deviation during the drive)
-
duration_frac_20s: Fraction of 20s drive completed before any failure- 1.0 = completed full 20s without issues
- Less than 1.0 = failed early (collision, off-road, or excessive deviation)
Distance Between Incidents
-
avg_dist_between_incidents: Average km traveled per incident (collision or offroad)- Higher is better; measures safety over distance
-
avg_dist_between_incidents_at_fault: Average km traveled per at-fault incident- Higher is better; excludes rear-end collisions not caused by the driver
Interpreting Results
Theaggregate/metrics_results.txt file shows statistics across all rollouts:
aggregate/videos/violations/ are organized by failure type for easy review.
Performance Metrics
Automatically Generated Metrics Plot
After each simulation run, AlpaSim automatically generates a comprehensive performance visualization. Location:runs/{RUN_DIR}/metrics/metrics_plot.png
This 3×3 grid plot includes:
Row 1: RPC Performance
- RPC Duration histogram - Total time from call start to coroutine resumption
- RPC Blocking histogram - Event loop scheduler delay
- RPC Queue Depth histogram - Service saturation levels
- Rollout Duration histogram - Total time per rollout
- Step Duration histogram - Time per simulation step
- Service Configuration table - Shows replica counts and capacity
- CPU Utilization boxplots - Per-service CPU usage
- GPU Utilization boxplots - GPU compute usage
- GPU Memory boxplots - Memory usage with capacity line
- Async worker idle percentage - How much time runtime spent idle
- Sim seconds per rollout - Wallclock time per simulation
Identifying Bottlenecks
High Queue Depth
High Queue Depth
Service is saturated → Increase
replicas_per_container or n_concurrent_rolloutsHigh RPC Duration
High RPC Duration
Service is slow → Consider optimization or scaling
Low GPU Utilization (<50%)
Low GPU Utilization (<50%)
Underutilized → Can increase load by scaling concurrent rollouts
High GPU Utilization (>90%)
High GPU Utilization (>90%)
May be saturated → Check for throttling, consider adding GPUs
Unbalanced Service Config
Unbalanced Service Config
Total capacity should match across all services to avoid bottlenecks
Performance Indicators
- Low idle percentage (less than 20%) → Runtime is busy, good utilization
- High idle percentage (greater than 80%) → Lots of waiting, check for bottlenecks
- Consistent rollout times → Good stability
- Wide rollout time variance → Investigate outliers in logs
Simulation Configuration
Enabling/Disabling Services
Useruntime.endpoints.<service>.skip to disable services:
Changing the Model
By default, the VaVAM driver and model are used. Model weights are downloaded usingdata/download_vavam_assets.sh and stored in data/vavam-driver/.
Using a Different Model
Mount a custom vavam-driver directory:data/vavam-driver/ (in repository root)
The wizard mounts defines.vavam_driver as /mnt/vavam_driver in the container and the driver loads the model from that path.
Using a Different Driver/Inference Code
To use a custom driver container image:src/driver/ are automatically mounted into containers at runtime.
Troubleshooting
Common Issues
Rollouts directory not appearing
Rollouts directory not appearing
Cause: Simulation failed to start or completeSolution:
- Check console logs for first error message
- Verify all services started successfully
- Check
txt-logs/for service-specific errors - Ensure scenes downloaded correctly to
data/nre-artifacts/all-usdzs/
Out of memory errors
Out of memory errors
Cause: GPU memory exhaustedSolution:
- Reduce
n_concurrent_rolloutsper service - Reduce
replicas_per_container - Use smaller batch sizes
- Check GPU memory usage in
metrics/metrics_plot.png
Timing synchronization errors
Timing synchronization errors
Cause: Misaligned timing parametersSolution:
- Verify
egopose_interval_usequalscontrol_timestep_us - Ensure
time_start_offset_usis a multiple ofcontrol_timestep_us - Check camera
frame_interval_usaligns with control timestep - Review Changing Inference Frequency section
Slow simulation performance
Slow simulation performance
Cause: Service bottleneck or misconfigurationSolution:
- Check
metrics/metrics_plot.pngfor queue depths and utilization - Identify bottleneck service (high queue depth)
- Increase replicas or concurrent rollouts for that service
- Verify all services have balanced total capacity