What is SlimeRouter?
SlimeRouter is a small FastAPI service that provides:- Worker registration: Registers SGLang HTTP servers into a local pool
- Load balancing: Routes requests using simple least-inflight load balancing
- Request proxying: Proxies arbitrary paths to selected workers (e.g.
/generate) - Health monitoring: Runs periodic health checks and quarantines unhealthy workers
- Middleware plugins: Supports middleware via
--slime-router-middleware-pathsfor rollout-specific processing (e.g. caching, request/response transforms)
How It Is Launched
In distributed training, slime automatically starts a router when--sglang-router-ip is not provided:
- If
--use-slime-routeris set, slime starts SlimeRouter - Otherwise, slime starts SGLang Model Gateway
Why We Need SlimeRouter
Unlike production inference, RL rollout needs to capture additional metadata for training:- Token-level log probabilities
- Loss masks
- Expert routing decisions (for MoE models)
Radix-Tree Cache
Use this when your rollout pipeline is text-in/text-out and you cannot reliably persist token IDs. If you already control token-in/token-out (e.g. search r1, multiturn VLM examples), you likely don’t need the radix-tree cache.
How It Works
- Intercepts text-based requests and tokenizes them
- Stores trajectories (text, token IDs, logprobs, loss masks) keyed by text prefix in a radix tree
- Uses longest-prefix matching to reuse cached token sequences
- Allows insertion of new text continuations as rollout proceeds (multiple trajectories per prompt for GRPO)
- Periodically cleans up stale nodes to control memory usage
- After rollout finishes, calling
/retrieve_from_textreturns exact token sequences with aligned metadata
Implementation
The radix tree is a string-based trie structure optimized for prefix matching:radix_tree.py
radix_tree.py
Middleware Integration
The radix tree is integrated via middleware that intercepts/generate requests:
radix_tree_middleware.py
Use Cases
GRPO with Multiple Trajectories
Multiple samples sharing the same prompt prefix can reuse cached tokens, reducing tokenization overhead and ensuring consistency.
Text-Based Rollout Code
When you have text-based rollout code and want token-level precision without rewriting your pipeline.
Rollout Routing Replay (R3) for MoE
For MoE models, slime supports Rollout Routing Replay (R3): record expert routing decisions during rollout and replay them during training to improve stability.SGLang Side
SGLang provides expert routing capture via:--enable-return-routed-experts: Server argument to enable routing captureRoutedExpertsCapturer: Capturestopk_ids(selected expert IDs) at each MoE layer during forward passreturn_routed_experts: Request parameter to retrieve routing data- Returns
routed_expertsin responsemeta_info- a[seq_len - 1, num_layers, top_k]tensor of expert IDs
Slime Side
Slime consumes the routing data and replays it during training:- Rollout sends
return_routed_experts=Trueand stores results insample.rollout_routed_experts - Training calls
fill_routing_replay()to load routing data intoRoutingReplayobjects - During forward pass, recorded routing decisions are replayed instead of recomputed
Why SlimeRouter Is Required
SlimeRouter is needed because the SGLang worker returns routed experts in the response (meta_info.routed_experts), and SlimeRouter preserves this field end-to-end. SGLang Model Gateway may drop this extra metadata when it reconstructs responses with a fixed schema.
Architecture
router.py
Load Balancing
SlimeRouter uses least-inflight load balancing:router.py
Health Monitoring
Background health check loop monitors all workers:router.py
SlimeRouter vs SGLang Model Gateway
SlimeRouter and SGLang Model Gateway can both route requests to workers, but they are optimized for different goals.Key Differences
When to Use Which
Use SlimeRouter
- You need R3 (rollout routing replay)
- You need radix-tree caching
- You need custom middleware for RL metadata
Use SGLang Model Gateway
- Everything else (recommended default)
- Maximum throughput and scalability
- Advanced fault tolerance
- Cache-aware routing
Session-Affinity Routing for Multi-Turn Agents
When using SGLang Model Gateway with consistent hashing routing policy, Slime automatically assigns each rollout session a unique session ID and uses it as the routing key to enable session affinity.What Is Session Affinity?
Session affinity (also called sticky sessions) ensures that all requests belonging to the same conversation or agent session are routed to the same backend worker. This is beneficial for:- Multi-turn dialogues: Keeping the same worker improves prefix cache hit rates
- Multi-agent systems: Ensures agent state consistency and better resource locality
- Debugging: Makes it easier to trace and debug specific sessions
How It Works
When the rollout system generates samples, each sample is assigned a uniquesession_id:
- Automatically generated using UUID for each sample
- Stored in
sample.session_idfield - Passed as
X-SMG-Routing-Keyheader when the router policy isconsistent_hashing
Configuration
To enable session-affinity routing:Notes
- Each sample gets its own unique session ID
- Different samples in the same group may be routed to different workers
- The same sample’s subsequent turns will maintain the same session ID
- Currently, this feature is only available for SGLang Model Gateway
API Reference
POST /add_worker
Add a new worker to the router. Request:GET /list_workers
List all registered workers. Request:POST /retrieve_from_text
Get token information from text input (requires RadixTreeMiddleware). Request:Example: Using RadixTreeMiddleware
Best Practices
Choose the Right Router
Use SlimeRouter only when you need its specialized features (R3, radix-tree). Otherwise, use SGLang Model Gateway for better performance.
Configure Connection Pooling
Set
slime-router-max-connections based on your concurrency needs. Default is sglang-server-concurrency * rollout-num-gpus / rollout-num-gpus-per-engine.Monitor Cache Hit Rates
When using radix-tree caching, monitor cache hit rates to ensure the cache is effective. Low hit rates may indicate the cache size is too small.
Test Middleware Plugins
Test custom middleware plugins thoroughly in development before deploying to production. Middleware errors can break the entire rollout pipeline.