Skip to main content
To ensure long-term, stable RL training at scale, slime provides fault tolerance mechanisms that automatically detect and recover from failures during rollout and training. This is critical for multi-day training runs on large clusters.

Enabling Fault Tolerance

To enable fault tolerance in slime, add the following flag to your training command:
--use-fault-tolerance

Rollout Fault Tolerance

During the rollout process, slime implements a health check system that monitors all SGLang servers and automatically handles failures.

How It Works

Slime periodically sends heartbeat requests (/health_generate) to all SGLang servers. The fault tolerance system:
  1. Detects failures when heartbeat requests timeout
  2. Stops unhealthy servers to prevent them from receiving new requests
  3. Completes the current rollout round using remaining healthy servers
  4. Restarts failed servers after the rollout round completes
  5. Updates parameters on restarted servers to match the current training state
This ensures that temporary failures don’t cause training to crash and that all servers maintain synchronized model weights.
The health check system is designed to be non-intrusive. Servers are only marked as unhealthy after multiple consecutive failures, preventing false positives from transient network issues.

Configuration Parameters

You can fine-tune the fault tolerance behavior with these parameters:

--rollout-health-check-first-wait

  • Default: 300 seconds
  • Description: Initial wait time before starting health checks
  • Use case: Large MoE models may require compilation on their first run. This parameter ensures slime waits long enough for initial compilation before beginning health monitoring.
--rollout-health-check-first-wait 600  # Wait 10 minutes for large models

--rollout-health-check-interval

  • Default: 10 seconds
  • Description: Interval between consecutive health check requests
  • Use case: Adjust based on your cluster’s network latency and stability
--rollout-health-check-interval 15  # Check every 15 seconds

--rollout-health-check-timeout

  • Default: 5 seconds
  • Description: Timeout limit for each individual heartbeat request
  • Use case: Increase for high-latency networks or decrease for faster failure detection
--rollout-health-check-timeout 10  # Allow up to 10 seconds per health check

Example Configuration

Here’s a complete example configuration for a fault-tolerant training setup:
MISC_ARGS=(
   --use-fault-tolerance
   --rollout-health-check-first-wait 300
   --rollout-health-check-interval 10
   --rollout-health-check-timeout 5
)

ray job submit --address="http://127.0.0.1:8265" \
   -- python3 train.py \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${MISC_ARGS[@]}

Failure Scenarios

The fault tolerance system handles various failure scenarios:
If an SGLang server crashes during rollout:
  1. The health check detects the failure within rollout-health-check-interval seconds
  2. The server is marked as unhealthy and removed from the active pool
  3. The current rollout continues using remaining healthy servers
  4. After the rollout completes, the server is restarted
  5. Model weights are synchronized before resuming normal operation
If network issues cause health check timeouts:
  1. The server is temporarily marked as unhealthy
  2. If the next health check succeeds, the server is restored to the pool
  3. Only after slime-router-health-check-failure-threshold consecutive failures is the server permanently quarantined until restart
If an SGLang server runs out of memory:
  1. The server typically becomes unresponsive to health checks
  2. Fault tolerance system detects the unresponsive state
  3. Server is stopped and restarted with fresh memory allocation
  4. Model weights are reloaded after restart
If a server encounters CUDA errors (e.g., illegal memory access):
  1. The server process typically exits or becomes unresponsive
  2. Health checks fail and trigger the recovery process
  3. Server is restarted with a clean CUDA context
  4. Training continues after weight synchronization

SlimeRouter Integration

When using SlimeRouter (see Slime Router), additional fault tolerance features are available:

Worker Quarantine

SlimeRouter maintains a quarantine list of unhealthy workers:
router.py
class SlimeRouter:
    def __init__(self, args, verbose=False):
        # URL -> Consecutive Failures
        self.worker_failure_counts: dict[str, int] = {}
        # Quarantined workers excluded from routing pool
        self.dead_workers: set[str] = set()
Workers are quarantined after exceeding the failure threshold and automatically removed from the routing pool:
router.py
if failures >= threshold:
    logger.warning(
        f"Worker {url} failed {threshold} consecutive health checks. Marking as DEAD."
    )
    self.dead_workers.add(url)

Automatic Load Balancing

The router automatically redistributes load among healthy workers:
router.py
def _use_url(self):
    """Select worker URL with minimal active requests."""
    if not self.dead_workers:
        # Healthy path: select from all workers
        url = min(self.worker_request_counts, key=self.worker_request_counts.get)
    else:
        # Degraded path: select from workers not in dead_workers
        valid_workers = (w for w in self.worker_request_counts if w not in self.dead_workers)
        url = min(valid_workers, key=self.worker_request_counts.get)
    return url

Best Practices

Set Conservative Timeouts

Use conservative timeout values initially, then tune based on your cluster’s characteristics. It’s better to tolerate occasional slowness than to falsely mark healthy servers as failed.

Monitor Health Metrics

Track health check success/failure rates in your monitoring system. Patterns of failures can indicate underlying infrastructure issues.

Account for Compilation

Large models with MoE layers may take 5-10 minutes to compile on first run. Set rollout-health-check-first-wait accordingly.

Test Failure Recovery

Periodically test fault tolerance by manually killing servers during training. Verify that training continues smoothly and weights stay synchronized.

Limitations

Current limitations of the fault tolerance system:
  • Weight synchronization: Reconnecting ‘dead’ workers requires a mechanism to sync model versions to avoid off-policy issues from stale weights. This is currently under development.
  • Training failures: The current implementation focuses on rollout fault tolerance. Training-side failures (e.g., GPU failures during backward pass) are not yet automatically recovered.
  • Data consistency: If a server fails mid-rollout, partial data from that server is discarded. The rollout batch may be slightly smaller than configured.

Future Improvements

Planned enhancements to the fault tolerance system:
  • Automatic weight version synchronization for restarted workers
  • Training-side failure recovery with checkpoint rollback
  • Predictive failure detection using hardware telemetry
  • Configurable retry policies for different failure types

Build docs developers (and LLMs) love