Enabling Fault Tolerance
To enable fault tolerance in slime, add the following flag to your training command:Rollout Fault Tolerance
During the rollout process, slime implements a health check system that monitors all SGLang servers and automatically handles failures.How It Works
Slime periodically sends heartbeat requests (/health_generate) to all SGLang servers. The fault tolerance system:
- Detects failures when heartbeat requests timeout
- Stops unhealthy servers to prevent them from receiving new requests
- Completes the current rollout round using remaining healthy servers
- Restarts failed servers after the rollout round completes
- Updates parameters on restarted servers to match the current training state
The health check system is designed to be non-intrusive. Servers are only marked as unhealthy after multiple consecutive failures, preventing false positives from transient network issues.
Configuration Parameters
You can fine-tune the fault tolerance behavior with these parameters:--rollout-health-check-first-wait
- Default: 300 seconds
- Description: Initial wait time before starting health checks
- Use case: Large MoE models may require compilation on their first run. This parameter ensures slime waits long enough for initial compilation before beginning health monitoring.
--rollout-health-check-interval
- Default: 10 seconds
- Description: Interval between consecutive health check requests
- Use case: Adjust based on your cluster’s network latency and stability
--rollout-health-check-timeout
- Default: 5 seconds
- Description: Timeout limit for each individual heartbeat request
- Use case: Increase for high-latency networks or decrease for faster failure detection
Example Configuration
Here’s a complete example configuration for a fault-tolerant training setup:Failure Scenarios
The fault tolerance system handles various failure scenarios:SGLang Server Crash
SGLang Server Crash
If an SGLang server crashes during rollout:
- The health check detects the failure within
rollout-health-check-intervalseconds - The server is marked as unhealthy and removed from the active pool
- The current rollout continues using remaining healthy servers
- After the rollout completes, the server is restarted
- Model weights are synchronized before resuming normal operation
Network Timeout
Network Timeout
If network issues cause health check timeouts:
- The server is temporarily marked as unhealthy
- If the next health check succeeds, the server is restored to the pool
- Only after
slime-router-health-check-failure-thresholdconsecutive failures is the server permanently quarantined until restart
Out of Memory (OOM)
Out of Memory (OOM)
If an SGLang server runs out of memory:
- The server typically becomes unresponsive to health checks
- Fault tolerance system detects the unresponsive state
- Server is stopped and restarted with fresh memory allocation
- Model weights are reloaded after restart
CUDA Errors
CUDA Errors
If a server encounters CUDA errors (e.g., illegal memory access):
- The server process typically exits or becomes unresponsive
- Health checks fail and trigger the recovery process
- Server is restarted with a clean CUDA context
- Training continues after weight synchronization
SlimeRouter Integration
When using SlimeRouter (see Slime Router), additional fault tolerance features are available:Worker Quarantine
SlimeRouter maintains a quarantine list of unhealthy workers:router.py
router.py
Automatic Load Balancing
The router automatically redistributes load among healthy workers:router.py
Best Practices
Set Conservative Timeouts
Use conservative timeout values initially, then tune based on your cluster’s characteristics. It’s better to tolerate occasional slowness than to falsely mark healthy servers as failed.
Monitor Health Metrics
Track health check success/failure rates in your monitoring system. Patterns of failures can indicate underlying infrastructure issues.
Account for Compilation
Large models with MoE layers may take 5-10 minutes to compile on first run. Set
rollout-health-check-first-wait accordingly.Test Failure Recovery
Periodically test fault tolerance by manually killing servers during training. Verify that training continues smoothly and weights stay synchronized.
Limitations
Future Improvements
Planned enhancements to the fault tolerance system:- Automatic weight version synchronization for restarted workers
- Training-side failure recovery with checkpoint rollback
- Predictive failure detection using hardware telemetry
- Configurable retry policies for different failure types