Fault Tolerance

To ensure long-term, stable RL training at scale, slime provides fault tolerance mechanisms that automatically detect and recover from failures during rollout and training. This is critical for multi-day training runs on large clusters.

Enabling Fault Tolerance

To enable fault tolerance in slime, add the following flag to your training command:

--use-fault-tolerance

Rollout Fault Tolerance

During the rollout process, slime implements a health check system that monitors all SGLang servers and automatically handles failures.

How It Works

Slime periodically sends heartbeat requests (/health_generate) to all SGLang servers. The fault tolerance system:

Detects failures when heartbeat requests timeout
Stops unhealthy servers to prevent them from receiving new requests
Completes the current rollout round using remaining healthy servers
Restarts failed servers after the rollout round completes
Updates parameters on restarted servers to match the current training state

This ensures that temporary failures don’t cause training to crash and that all servers maintain synchronized model weights.

The health check system is designed to be non-intrusive. Servers are only marked as unhealthy after multiple consecutive failures, preventing false positives from transient network issues.

Configuration Parameters

You can fine-tune the fault tolerance behavior with these parameters:

`--rollout-health-check-first-wait`

Default: 300 seconds
Description: Initial wait time before starting health checks
Use case: Large MoE models may require compilation on their first run. This parameter ensures slime waits long enough for initial compilation before beginning health monitoring.

--rollout-health-check-first-wait 600  # Wait 10 minutes for large models

`--rollout-health-check-interval`

Default: 10 seconds
Description: Interval between consecutive health check requests
Use case: Adjust based on your cluster’s network latency and stability

--rollout-health-check-interval 15  # Check every 15 seconds

`--rollout-health-check-timeout`

Default: 5 seconds
Description: Timeout limit for each individual heartbeat request
Use case: Increase for high-latency networks or decrease for faster failure detection

--rollout-health-check-timeout 10  # Allow up to 10 seconds per health check

Example Configuration

Here’s a complete example configuration for a fault-tolerant training setup:

MISC_ARGS=(
   --use-fault-tolerance
   --rollout-health-check-first-wait 300
   --rollout-health-check-interval 10
   --rollout-health-check-timeout 5
)

ray job submit --address="http://127.0.0.1:8265" \
   -- python3 train.py \
   ${MODEL_ARGS[@]} \
   ${CKPT_ARGS[@]} \
   ${ROLLOUT_ARGS[@]} \
   ${MISC_ARGS[@]}

Failure Scenarios

The fault tolerance system handles various failure scenarios:

SGLang Server Crash

If an SGLang server crashes during rollout:

The health check detects the failure within rollout-health-check-interval seconds
The server is marked as unhealthy and removed from the active pool
The current rollout continues using remaining healthy servers
After the rollout completes, the server is restarted
Model weights are synchronized before resuming normal operation

Network Timeout

If network issues cause health check timeouts:

The server is temporarily marked as unhealthy
If the next health check succeeds, the server is restored to the pool
Only after slime-router-health-check-failure-threshold consecutive failures is the server permanently quarantined until restart

Out of Memory (OOM)

If an SGLang server runs out of memory:

The server typically becomes unresponsive to health checks
Fault tolerance system detects the unresponsive state
Server is stopped and restarted with fresh memory allocation
Model weights are reloaded after restart

CUDA Errors

If a server encounters CUDA errors (e.g., illegal memory access):

The server process typically exits or becomes unresponsive
Health checks fail and trigger the recovery process
Server is restarted with a clean CUDA context
Training continues after weight synchronization

SlimeRouter Integration

When using SlimeRouter (see Slime Router), additional fault tolerance features are available:

Worker Quarantine

SlimeRouter maintains a quarantine list of unhealthy workers:

router.py

class SlimeRouter:
    def __init__(self, args, verbose=False):
        # URL -> Consecutive Failures
        self.worker_failure_counts: dict[str, int] = {}
        # Quarantined workers excluded from routing pool
        self.dead_workers: set[str] = set()

Workers are quarantined after exceeding the failure threshold and automatically removed from the routing pool:

router.py

if failures >= threshold:
    logger.warning(
        f"Worker {url} failed {threshold} consecutive health checks. Marking as DEAD."
    )
    self.dead_workers.add(url)

Automatic Load Balancing

The router automatically redistributes load among healthy workers:

router.py

def _use_url(self):
    """Select worker URL with minimal active requests."""
    if not self.dead_workers:
        # Healthy path: select from all workers
        url = min(self.worker_request_counts, key=self.worker_request_counts.get)
    else:
        # Degraded path: select from workers not in dead_workers
        valid_workers = (w for w in self.worker_request_counts if w not in self.dead_workers)
        url = min(valid_workers, key=self.worker_request_counts.get)
    return url

Best Practices

Set Conservative Timeouts

Use conservative timeout values initially, then tune based on your cluster’s characteristics. It’s better to tolerate occasional slowness than to falsely mark healthy servers as failed.

Monitor Health Metrics

Track health check success/failure rates in your monitoring system. Patterns of failures can indicate underlying infrastructure issues.

Account for Compilation

Large models with MoE layers may take 5-10 minutes to compile on first run. Set rollout-health-check-first-wait accordingly.

Test Failure Recovery

Periodically test fault tolerance by manually killing servers during training. Verify that training continues smoothly and weights stay synchronized.

Limitations

Current limitations of the fault tolerance system:

Weight synchronization: Reconnecting ‘dead’ workers requires a mechanism to sync model versions to avoid off-policy issues from stale weights. This is currently under development.
Training failures: The current implementation focuses on rollout fault tolerance. Training-side failures (e.g., GPU failures during backward pass) are not yet automatically recovered.
Data consistency: If a server fails mid-rollout, partial data from that server is discarded. The rollout batch may be slightly smaller than configured.

Future Improvements

Planned enhancements to the fault tolerance system:

Automatic weight version synchronization for restarted workers
Training-side failure recovery with checkpoint rollback
Predictive failure detection using hardware telemetry
Configurable retry policies for different failure types

Get Started

Core Concepts

Guides

Advanced

Platform Support

Fault Tolerance

Enabling Fault Tolerance

Rollout Fault Tolerance

How It Works

Configuration Parameters

`--rollout-health-check-first-wait`

`--rollout-health-check-interval`

`--rollout-health-check-timeout`

Example Configuration

Failure Scenarios

SlimeRouter Integration

Worker Quarantine

Automatic Load Balancing

Best Practices

Set Conservative Timeouts

Monitor Health Metrics

Account for Compilation

Test Failure Recovery

Limitations

Future Improvements

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​Enabling Fault Tolerance

​Rollout Fault Tolerance

​How It Works

​Configuration Parameters

​--rollout-health-check-first-wait

​--rollout-health-check-interval

​--rollout-health-check-timeout

​Example Configuration

​Failure Scenarios

​SlimeRouter Integration

​Worker Quarantine

​Automatic Load Balancing

​Best Practices

Set Conservative Timeouts

Monitor Health Metrics

Account for Compilation

Test Failure Recovery

​Limitations

​Future Improvements

Build docs developers (and LLMs) love

Enabling Fault Tolerance

Rollout Fault Tolerance

How It Works

Configuration Parameters

`--rollout-health-check-first-wait`

`--rollout-health-check-interval`

`--rollout-health-check-timeout`

Example Configuration

Failure Scenarios

SlimeRouter Integration

Worker Quarantine

Automatic Load Balancing

Best Practices

Limitations

Future Improvements