Skip to main content

Rollout Generation

Rollout generation is the process of using the current policy to generate new training data. slime uses SGLang for high-throughput inference with advanced features like dynamic sampling and partial rollouts.

Basic Rollout Flow

From sglang_rollout.py:108-205, the standard generation process:
async def generate(args, sample, sampling_params):
    """Generate using SGLang router with token-based workflow"""
    state = GenerateState(args)
    url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}/generate"
    
    # 1. Tokenize prompt
    if state.processor and sample.multimodal_inputs:
        processor_output = state.processor(text=sample.prompt, **processor_kwargs)
        prompt_ids = processor_output["input_ids"][0]
    else:
        prompt_ids = state.tokenizer.encode(sample.prompt, add_special_tokens=False)
    
    # 2. Prepare payload
    payload = {
        "input_ids": prompt_ids,
        "sampling_params": sampling_params,
        "return_logprob": True,
    }
    
    # 3. Send request to SGLang
    output = await post(url, payload, headers=headers)
    
    # 4. Update sample with response
    new_response_tokens = [item[1] for item in output["meta_info"]["output_token_logprobs"]]
    new_response_log_probs = [item[0] for item in output["meta_info"]["output_token_logprobs"]]
    
    sample.tokens = sample.tokens + new_response_tokens
    sample.response_length += len(new_response_tokens)
    sample.response += output["text"]
    sample.rollout_log_probs += new_response_log_probs
    
    return sample

Sampling Parameters

ROLLOUT_ARGS=(
    # Temperature for sampling (higher = more random)
    --rollout-temperature 1.0
    
    # Top-p (nucleus) sampling
    --rollout-top-p 1.0
    
    # Top-k sampling
    --rollout-top-k -1
    
    # Maximum response length
    --rollout-max-response-len 8192
    
    # Stop sequences
    --rollout-stop "</s>" "<|endoftext|>"
    
    # Stop token IDs
    --rollout-stop-token-ids 128001 128008
)

Custom Generation Functions

For complex workflows like multi-turn interactions or tool use, slime supports custom generation functions.

Basic Custom Function

# custom_generate.py
async def custom_generate(args, sample: Sample, sampling_params) -> Sample:
    """Custom generation function with special logic"""
    
    # Access metadata from sample
    session_id = sample.metadata.get("session_id")
    tools = sample.metadata.get("tools")
    
    # Your custom generation logic
    response = await your_generation_logic(sample.prompt, tools)
    
    # Update sample
    sample.response = response
    sample.tokens = tokenize(sample.prompt + response)
    
    return sample

Multi-Turn Generation

From the quick start guide (lines 476-521):
async def generate(args, sample: Sample, sampling_params) -> Sample:
    prompt = sample.prompt
    full_response = ""
    loss_masks = []
    
    for turn in range(max_turns):
        # 1. Model generates action
        model_output = await call_sglang(prompt + full_response, ...)
        model_tokens = tokenize(model_output)
        
        # Mark model-generated tokens for loss calculation
        loss_masks += [1] * len(model_tokens)
        full_response += model_output
        
        # 2. Parse action
        action, content = parse_action(model_output)
        
        if action == "search":
            # 3. Execute tool
            tool_output = await google_search(content)
            tool_tokens = tokenize(tool_output)
            
            # DO NOT calculate loss on tool outputs
            loss_masks += [0] * len(tool_tokens)
            full_response += tool_output
        
        elif action == "answer":
            break  # End multi-turn interaction
    
    # Update sample with complete trajectory
    sample.response = full_response
    sample.tokens = tokenize(prompt + full_response)
    sample.loss_mask = loss_masks
    
    return sample
Loss Masking is Critical:The loss_mask must be the same length as response_tokens. Set to:
  • 1 for tokens that should contribute to loss (model-generated)
  • 0 for tokens that should NOT contribute to loss (environment/tool outputs)

Enabling Custom Functions

CUSTOM_ARGS=(
    # Path format: module.file:function_name
    --custom-generate-function-path your_module.multiturn_logic.generate
)

Reward Models

slime supports multiple built-in reward model types and custom reward functions.

Built-in Reward Models

From rm_hub/__init__.py:55-92:
# Extract answer from response and compare with label
def get_deepscaler_rule_based_reward(response, label):
    # Parse model answer
    if "</think>" in response:
        model_solution = response.split("</think>")[-1]
    elif "###Response" in response:
        model_solution = response.split("###Response")[1]
    else:
        return 0
    
    # Extract boxed answer
    model_answer = extract_answer(model_solution)
    if model_answer is None:
        return 0
    
    # Grade against ground truth
    for ground_truth in processed_ground_truths:
        is_correct = (grade_answer_mathd(model_answer, ground_truth) or 
                     grade_answer_sympy(model_answer, ground_truth))
        if is_correct:
            return 1
    
    return 0
Usage:
--rm-type deepscaler
# DAPO-style math reward (from dapo-math-17k dataset)
--rm-type dapo
Computes correctness score for mathematical reasoning tasks.
# Simple correct/incorrect for math problems
--rm-type math
Returns 1 if correct, 0 otherwise using veRL’s grading function.
# Token-level F1 score between response and label
--rm-type f1
Useful for tasks requiring partial credit (e.g., entity extraction).
# Google-Proof Q&A reward
--rm-type gpqa
For graduate-level science questions.
# Instruction-following benchmark reward
--rm-type ifbench
Measures instruction-following capabilities.
# Call external reward model service
--rm-type remote_rm
--rm-url http://your-reward-model-server:8000/evaluate
Sends {"prompt", "response", "label"} to external service.

Custom Reward Functions

# custom_reward.py
async def reward_func(args, sample: Sample, **kwargs) -> float:
    """Custom reward function"""
    
    # Access sample data
    prompt = sample.prompt
    response = sample.response
    label = sample.label
    metadata = sample.metadata
    
    # Your custom reward logic
    if check_format(response):
        format_score = 1.0
    else:
        format_score = 0.0
    
    correctness = compute_correctness(response, label)
    
    # Combine multiple reward signals
    total_reward = 0.7 * correctness + 0.3 * format_score
    
    return total_reward
# Enable custom reward function
--custom-rm-path your_module.custom_reward.reward_func

Group Reward Models

For algorithms like GRPO that need to compare multiple responses:
async def batched_reward(args, samples: list[Sample], **kwargs) -> list[float]:
    """Reward function that processes a group of samples together"""
    
    # All samples share the same prompt
    assert len(set(s.prompt for s in samples)) == 1
    
    responses = [s.response for s in samples]
    
    # Compute relative ranking or pairwise comparisons
    rewards = compute_group_rewards(responses)
    
    return rewards
--group-rm
--custom-rm-path your_module.custom_reward.batched_reward

Dynamic Sampling

Dynamic sampling allows you to filter out low-quality rollout data before training, improving data efficiency.
From the quick start guide (lines 339-370):
ROLLOUT_ARGS=(
    # Over-sample more prompts than needed
    --rollout-batch-size 32
    --over-sampling-batch-size 64
    
    # Filter function to select high-quality samples
    --dynamic-sampling-filter-path \
        slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
)

Built-in Filter: Non-Zero Reward Std

From the quick start guide:
def check_reward_nonzero_std(args, samples: list[Sample], **kwargs):
    """Keep only groups where rewards have non-zero standard deviation"""
    rewards = [sample.get_reward_value(args) for sample in samples]
    keep = torch.tensor(rewards, dtype=torch.float).std() > 0.0
    
    return DynamicFilterOutput(
        keep=keep,
        reason=None if keep else f"zero_std_{round(rewards[0], 1)}",
    )
This ensures diversity in rewards for each prompt group, preventing the model from learning on homogeneous data.

Custom Dynamic Filters

from slime.rollout.filter_hub.base_types import DynamicFilterOutput

def custom_filter(args, samples: list[Sample], **kwargs):
    """Custom filter logic"""
    
    # Example: Keep only if at least one response is correct
    has_correct = any(s.reward > 0.9 for s in samples)
    
    # Example: Filter by response length
    avg_length = sum(s.response_length for s in samples) / len(samples)
    reasonable_length = 50 <= avg_length <= 2000
    
    keep = has_correct and reasonable_length
    
    return DynamicFilterOutput(
        keep=keep,
        reason="filtered" if not keep else None,
    )

Partial Rollout

Partial rollout caches incomplete generations during dynamic sampling, improving computational efficiency.
# Enable partial rollout
--partial-rollout

# Optional: Mask off-policy tokens in partial rollouts
--mask-offpolicy-in-partial-rollout

# Custom buffer extraction strategy
--buffer-filter-path slime.rollout.filter_hub.buffer_filters.pop_first
From sglang_rollout.py:214-217:
# Mark previous off-policy generation for partial rollout
if args.partial_rollout and args.mask_offpolicy_in_partial_rollout:
    if sample.response_length > 0:
        sample.loss_mask = [0] * sample.response_length
From the quick start guide (lines 378-386):
def pop_first(args, rollout_id, buffer: list[list[Sample]], num_samples: int):
    """Extract samples from buffer in FIFO order"""
    num_to_pop = min(len(buffer), num_samples)
    samples = buffer[:num_to_pop]
    del buffer[:num_to_pop]
    return samples

Data Preprocessing

For complex workflows, preprocess your data to include metadata:
import json

# Original data
data = [
    {"question": "...", "answer": "...", "session_id": "...", "tools": [...]}.
]

# Preprocess: Pack extra fields into metadata
processed = []
for item in data:
    processed.append({
        "question": item["question"],
        "answer": item["answer"],
        "metadata": json.dumps({
            "session_id": item["session_id"],
            "tools": item["tools"],
        })
    })

with open("processed.jsonl", "w") as f:
    for item in processed:
        f.write(json.dumps(item) + "\n")
ROLLOUT_ARGS=(
    --prompt-data /path/to/processed.jsonl
    --input-key question
    --label-key answer
    --metadata-key metadata  # slime auto-parses JSON to dict
)

FP8 Inference with BF16 Training

For faster rollout generation, use FP8 quantized weights for inference while training in BF16.
# Download FP8 checkpoint
hf download Qwen/Qwen3-4B-FP8 --local-dir /root/Qwen3-4B-FP8

CKPT_ARGS=(
    # Use FP8 checkpoint for tokenizer and rollout
    --hf-checkpoint /root/Qwen3-4B-FP8
    
    # Use BF16 checkpoint for training
    --ref-load /root/Qwen3-4B_torch_dist
    --load /root/Qwen3-4B_slime/
)
slime automatically casts BF16 weights to FP8 during weight synchronization.

Training Loop

Learn about the complete training cycle

Algorithms

Understand advantage estimation methods

Build docs developers (and LLMs) love