Experiment Tracking: wandb, MLflow, and TensorBoard

Effective experiment tracking is essential for RL training, where reward curves can be noisy, training can diverge unexpectedly, and runs often last many hours. verl integrates with multiple tracking backends out of the box — from cloud-hosted services like Weights & Biases and MLflow to local options like TensorBoard. This page explains how to configure them and which metrics to watch.

Supported Loggers

verl supports the following logging backends, which can be enabled simultaneously:

Backend	Value in Config	Notes
Console	`"console"`	Always-available stdout logging
Weights & Biases	`"wandb"`	Requires `WANDB_API_KEY` environment variable
SwanLab	`"swanlab"`	Alternative experiment tracker
MLflow	`"mlflow"`	Requires a tracking server or local directory
TensorBoard	`"tensorboard"`	Logs to `tensorboard_log/{project_name}/{experiment_name}` (override with `TENSORBOARD_DIR` env var)
Trackio	`"trackio"`	Lightweight alternative tracker

Configuration

All logging is configured under the trainer section:

trainer:
  project_name: my-rl-project
  experiment_name: ppo-qwen25-gsm8k
  logger: ["wandb", "tensorboard"]
  log_val_generations: 10    # number of validation samples to log each validation step

trainer.logger

list

default:"[\"console\", \"wandb\"]"

List of active logging backends. Provide multiple values to log to several backends simultaneously. Example: ["wandb", "tensorboard", "console"].

trainer.project_name

string

default:"verl_examples"

Project name used as the top-level grouping in wandb, SwanLab, and MLflow.

trainer.experiment_name

string

default:"gsm8k"

Run name used to identify this specific experiment within the project. Also used as a component of the checkpoint directory path.

trainer.log_val_generations

int

default:"0"

Number of validation generations (prompt + response pairs) to log at each validation step. Logging generations lets you qualitatively inspect model behavior alongside quantitative metrics. Set to 0 to disable for maximum throughput.

Weights & Biases

Set the WANDB_API_KEY environment variable to authenticate. All metrics and logged generations are uploaded automatically:

export WANDB_API_KEY=your_api_key_here

If your cluster requires a proxy to reach the wandb API, set a targeted proxy without affecting other HTTP traffic:

# In your training launch script:
+trainer.wandb_proxy=http://<your-proxy-host>:<port>

This approach avoids interfering with other HTTP requests such as the vLLM/SGLang chat completion scheduler.

MLflow

Point verl at a tracking server or a local directory via the standard MLFLOW_TRACKING_URI environment variable:

# Remote tracking server
export MLFLOW_TRACKING_URI=http://mlflow-server:5000

# Local directory
export MLFLOW_TRACKING_URI=file:///path/to/mlruns

Then set logger: ["mlflow"] (or include it alongside other loggers).

TensorBoard

TensorBoard events are written to tensorboard_log/{project_name}/{experiment_name} by default (override with the TENSORBOARD_DIR environment variable). Launch the viewer pointing at that directory:

tensorboard --logdir tensorboard_log/my-rl-project/ppo-qwen25-gsm8k

TensorBoard logging adds minimal overhead and works entirely offline, making it a good choice alongside wandb for redundancy or when running on air-gapped clusters.

Key Metrics to Monitor

verl logs a rich set of metrics at each training step. Below are the most important ones to watch.

Reward Metrics

Metric	Description
`reward/mean`	Average reward per step — the primary training metric
`reward/std`	Reward variance across the batch
`reward/max`	Maximum reward in the batch
`reward/min`	Minimum reward in the batch

The reward/mean curve should trend upward over training. Sudden plateaus or drops often indicate reward function issues, KL coefficient problems, or optimizer instability.

Monitor reward/max and reward/min alongside reward/mean. A widening gap between max and min with a flat mean often signals reward hacking — the model exploits edge cases in your reward function rather than genuinely improving. If reward/max saturates quickly while reward/mean stays low, the reward function may have a ceiling issue.

Response Length Metrics

Metric	Description
`response_length/mean`	Average number of tokens in generated responses
`response_length/std`	Variance in response length
`response_length/max`	Maximum response length observed

Watch response_length/mean carefully throughout training. A steady increase often signals length reward hacking — the model discovers that longer responses receive higher rewards regardless of quality. If your reward function rewards verbosity, cap response length and penalize unnecessary padding.

Policy Metrics

Metric	Description
`actor/loss`	Actor policy loss (PPO objective)
`actor/pg_loss`	Policy gradient component of actor loss
`actor/kl`	KL divergence from the reference policy
`actor/grad_norm`	Actor gradient norm — watch for spikes indicating instability
`actor/entropy`	Policy entropy — should not collapse to near zero
`kl_coef`	Current KL coefficient (changes when using adaptive KL controller)

Critic Metrics (PPO Only)

Metric	Description
`critic/loss`	Critic value function loss (MSE against returns)
`critic/values/mean`	Mean predicted value
`critic/returns/mean`	Mean actual returns used for critic training

Throughput Metrics

Metric	Description
`rollout/throughput`	Tokens per second during rollout generation
`train/throughput`	Tokens per second during actor/critic update
`timing/rollout`	Wall-clock time for the rollout stage (seconds)
`timing/update_actor`	Wall-clock time for actor parameter update (seconds)

Precision Diagnostics

When actor_rollout_ref.rollout.calculate_log_probs=True, verl logs:

Metric	Description
`training/rollout_probs_diff_mean`	Mean absolute difference between log probs from the rollout engine and the training engine. Values below `0.005` are normal; above `0.01` suggests a precision mismatch between inference and training.

A high rollout_probs_diff_mean can cause actor/grad_norm to grow continuously. See the FAQ for remediation steps.

Grafana and Prometheus Cluster Monitoring

For cluster-level hardware monitoring alongside training metrics, verl supports Prometheus metric exposition from the rollout engine, with Grafana dashboards for visualization.

Rollout Engine Prometheus Metrics

The vLLM/SGLang rollout server can expose Prometheus metrics over HTTP. Enable this in the rollout config:

actor_rollout_ref:
  rollout:
    prometheus:
      enable: true
      port: 9090
      file: /tmp/ray/session_latest/metrics/prometheus/prometheus.yml
      served_model_name: my-model-name  # shorter name for Grafana display

actor_rollout_ref.rollout.prometheus.enable

boolean

default:"false"

Expose Prometheus metrics from the rollout engine over HTTP. Useful for monitoring KV cache utilization, request queue depth, and inference throughput at the server level.

actor_rollout_ref.rollout.prometheus.port

int

default:"9090"

HTTP port for the Prometheus /metrics endpoint.

actor_rollout_ref.rollout.prometheus.served_model_name

string

Short model name displayed in Grafana instead of the full model path. Useful when model paths are very long.

Ray Timeline Profiling

To generate a Ray timeline for performance analysis of a training job, set ray_kwargs.timeline_json_file:

python -m verl.trainer.main_ppo \
    +ray_kwargs.timeline_json_file=/tmp/ray_timeline.json \
    ...

The JSON file is written at the end of the training run. Load it in Perfetto UI or chrome://tracing for a flame graph of all Ray tasks across the cluster.

Setup Reference

Full Grafana and Prometheus setup instructions (including dashboard templates and scrape configurations) are documented in docs/advance/grafana_prometheus.md in the verl repository.

Logging Overhead and Performance Tips

Logging validation generations (log_val_generations) involves serializing prompt and response text and uploading it to the tracking backend. If you have large validation batches or are running on a slow network, reduce this value or set it to 0 during initial hyperparameter sweeps and re-enable it for final runs.

Additional tips for minimizing logging overhead:

Use "console" only during initial debugging; switch to "wandb" or "tensorboard" for full runs.
The rollout/throughput and train/throughput metrics are only accurate when actor_rollout_ref.rollout.disable_log_stats=False — set disable_log_stats=False to enable these metrics while tuning performance.
For Ray timeline analysis, set timeline_json_file only for profiling runs, as it adds file I/O at job completion.

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Experiment Tracking: wandb, MLflow, and TensorBoard

Supported Loggers

Configuration

Weights & Biases

MLflow

TensorBoard

Key Metrics to Monitor

Reward Metrics

Response Length Metrics

Policy Metrics

Critic Metrics (PPO Only)

Throughput Metrics

Precision Diagnostics

Grafana and Prometheus Cluster Monitoring

Rollout Engine Prometheus Metrics

Ray Timeline Profiling

Setup Reference

Logging Overhead and Performance Tips

Build docs developers (and LLMs) love

Get Started

Core Concepts

Algorithms

Workers & Engines

Advanced Usage

Configuration & Reference

Documentation Index

​Supported Loggers

​Configuration

​Weights & Biases

​MLflow

​TensorBoard

​Key Metrics to Monitor

​Reward Metrics

​Response Length Metrics

​Policy Metrics

​Critic Metrics (PPO Only)

​Throughput Metrics

​Precision Diagnostics

​Grafana and Prometheus Cluster Monitoring

​Rollout Engine Prometheus Metrics

​Ray Timeline Profiling

​Setup Reference

​Logging Overhead and Performance Tips

Build docs developers (and LLMs) love

Supported Loggers

Configuration

Weights & Biases

MLflow

TensorBoard

Key Metrics to Monitor

Reward Metrics

Response Length Metrics

Policy Metrics

Critic Metrics (PPO Only)

Throughput Metrics

Precision Diagnostics

Grafana and Prometheus Cluster Monitoring

Rollout Engine Prometheus Metrics

Ray Timeline Profiling

Setup Reference

Logging Overhead and Performance Tips