Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/verl-project/verl/llms.txt

Use this file to discover all available pages before exploring further.

Effective experiment tracking is essential for RL training, where reward curves can be noisy, training can diverge unexpectedly, and runs often last many hours. verl integrates with multiple tracking backends out of the box — from cloud-hosted services like Weights & Biases and MLflow to local options like TensorBoard. This page explains how to configure them and which metrics to watch.

Supported Loggers

verl supports the following logging backends, which can be enabled simultaneously:
BackendValue in ConfigNotes
Console"console"Always-available stdout logging
Weights & Biases"wandb"Requires WANDB_API_KEY environment variable
SwanLab"swanlab"Alternative experiment tracker
MLflow"mlflow"Requires a tracking server or local directory
TensorBoard"tensorboard"Logs to tensorboard_log/{project_name}/{experiment_name} (override with TENSORBOARD_DIR env var)
Trackio"trackio"Lightweight alternative tracker

Configuration

All logging is configured under the trainer section:
trainer:
  project_name: my-rl-project
  experiment_name: ppo-qwen25-gsm8k
  logger: ["wandb", "tensorboard"]
  log_val_generations: 10    # number of validation samples to log each validation step
trainer.logger
list
default:"[\"console\", \"wandb\"]"
List of active logging backends. Provide multiple values to log to several backends simultaneously. Example: ["wandb", "tensorboard", "console"].
trainer.project_name
string
default:"verl_examples"
Project name used as the top-level grouping in wandb, SwanLab, and MLflow.
trainer.experiment_name
string
default:"gsm8k"
Run name used to identify this specific experiment within the project. Also used as a component of the checkpoint directory path.
trainer.log_val_generations
int
default:"0"
Number of validation generations (prompt + response pairs) to log at each validation step. Logging generations lets you qualitatively inspect model behavior alongside quantitative metrics. Set to 0 to disable for maximum throughput.

Weights & Biases

Set the WANDB_API_KEY environment variable to authenticate. All metrics and logged generations are uploaded automatically:
export WANDB_API_KEY=your_api_key_here
If your cluster requires a proxy to reach the wandb API, set a targeted proxy without affecting other HTTP traffic:
# In your training launch script:
+trainer.wandb_proxy=http://<your-proxy-host>:<port>
This approach avoids interfering with other HTTP requests such as the vLLM/SGLang chat completion scheduler.

MLflow

Point verl at a tracking server or a local directory via the standard MLFLOW_TRACKING_URI environment variable:
# Remote tracking server
export MLFLOW_TRACKING_URI=http://mlflow-server:5000

# Local directory
export MLFLOW_TRACKING_URI=file:///path/to/mlruns
Then set logger: ["mlflow"] (or include it alongside other loggers).

TensorBoard

TensorBoard events are written to tensorboard_log/{project_name}/{experiment_name} by default (override with the TENSORBOARD_DIR environment variable). Launch the viewer pointing at that directory:
tensorboard --logdir tensorboard_log/my-rl-project/ppo-qwen25-gsm8k
TensorBoard logging adds minimal overhead and works entirely offline, making it a good choice alongside wandb for redundancy or when running on air-gapped clusters.

Key Metrics to Monitor

verl logs a rich set of metrics at each training step. Below are the most important ones to watch.

Reward Metrics

MetricDescription
reward/meanAverage reward per step — the primary training metric
reward/stdReward variance across the batch
reward/maxMaximum reward in the batch
reward/minMinimum reward in the batch
The reward/mean curve should trend upward over training. Sudden plateaus or drops often indicate reward function issues, KL coefficient problems, or optimizer instability.
Monitor reward/max and reward/min alongside reward/mean. A widening gap between max and min with a flat mean often signals reward hacking — the model exploits edge cases in your reward function rather than genuinely improving. If reward/max saturates quickly while reward/mean stays low, the reward function may have a ceiling issue.

Response Length Metrics

MetricDescription
response_length/meanAverage number of tokens in generated responses
response_length/stdVariance in response length
response_length/maxMaximum response length observed
Watch response_length/mean carefully throughout training. A steady increase often signals length reward hacking — the model discovers that longer responses receive higher rewards regardless of quality. If your reward function rewards verbosity, cap response length and penalize unnecessary padding.

Policy Metrics

MetricDescription
actor/lossActor policy loss (PPO objective)
actor/pg_lossPolicy gradient component of actor loss
actor/klKL divergence from the reference policy
actor/grad_normActor gradient norm — watch for spikes indicating instability
actor/entropyPolicy entropy — should not collapse to near zero
kl_coefCurrent KL coefficient (changes when using adaptive KL controller)

Critic Metrics (PPO Only)

MetricDescription
critic/lossCritic value function loss (MSE against returns)
critic/values/meanMean predicted value
critic/returns/meanMean actual returns used for critic training

Throughput Metrics

MetricDescription
rollout/throughputTokens per second during rollout generation
train/throughputTokens per second during actor/critic update
timing/rolloutWall-clock time for the rollout stage (seconds)
timing/update_actorWall-clock time for actor parameter update (seconds)

Precision Diagnostics

When actor_rollout_ref.rollout.calculate_log_probs=True, verl logs:
MetricDescription
training/rollout_probs_diff_meanMean absolute difference between log probs from the rollout engine and the training engine. Values below 0.005 are normal; above 0.01 suggests a precision mismatch between inference and training.
A high rollout_probs_diff_mean can cause actor/grad_norm to grow continuously. See the FAQ for remediation steps.

Grafana and Prometheus Cluster Monitoring

For cluster-level hardware monitoring alongside training metrics, verl supports Prometheus metric exposition from the rollout engine, with Grafana dashboards for visualization.

Rollout Engine Prometheus Metrics

The vLLM/SGLang rollout server can expose Prometheus metrics over HTTP. Enable this in the rollout config:
actor_rollout_ref:
  rollout:
    prometheus:
      enable: true
      port: 9090
      file: /tmp/ray/session_latest/metrics/prometheus/prometheus.yml
      served_model_name: my-model-name  # shorter name for Grafana display
actor_rollout_ref.rollout.prometheus.enable
boolean
default:"false"
Expose Prometheus metrics from the rollout engine over HTTP. Useful for monitoring KV cache utilization, request queue depth, and inference throughput at the server level.
actor_rollout_ref.rollout.prometheus.port
int
default:"9090"
HTTP port for the Prometheus /metrics endpoint.
actor_rollout_ref.rollout.prometheus.served_model_name
string
Short model name displayed in Grafana instead of the full model path. Useful when model paths are very long.

Ray Timeline Profiling

To generate a Ray timeline for performance analysis of a training job, set ray_kwargs.timeline_json_file:
python -m verl.trainer.main_ppo \
    +ray_kwargs.timeline_json_file=/tmp/ray_timeline.json \
    ...
The JSON file is written at the end of the training run. Load it in Perfetto UI or chrome://tracing for a flame graph of all Ray tasks across the cluster.

Setup Reference

Full Grafana and Prometheus setup instructions (including dashboard templates and scrape configurations) are documented in docs/advance/grafana_prometheus.md in the verl repository.

Logging Overhead and Performance Tips

Logging validation generations (log_val_generations) involves serializing prompt and response text and uploading it to the tracking backend. If you have large validation batches or are running on a slow network, reduce this value or set it to 0 during initial hyperparameter sweeps and re-enable it for final runs.
Additional tips for minimizing logging overhead:
  • Use "console" only during initial debugging; switch to "wandb" or "tensorboard" for full runs.
  • The rollout/throughput and train/throughput metrics are only accurate when actor_rollout_ref.rollout.disable_log_stats=False — set disable_log_stats=False to enable these metrics while tuning performance.
  • For Ray timeline analysis, set timeline_json_file only for profiling runs, as it adds file I/O at job completion.

Build docs developers (and LLMs) love