Rollout & Reward

Rollout Generation

Rollout generation is the process of using the current policy to generate new training data. slime uses SGLang for high-throughput inference with advanced features like dynamic sampling and partial rollouts.

Basic Rollout Flow

From sglang_rollout.py:108-205, the standard generation process:

async def generate(args, sample, sampling_params):
    """Generate using SGLang router with token-based workflow"""
    state = GenerateState(args)
    url = f"http://{args.sglang_router_ip}:{args.sglang_router_port}/generate"
    
    # 1. Tokenize prompt
    if state.processor and sample.multimodal_inputs:
        processor_output = state.processor(text=sample.prompt, **processor_kwargs)
        prompt_ids = processor_output["input_ids"][0]
    else:
        prompt_ids = state.tokenizer.encode(sample.prompt, add_special_tokens=False)
    
    # 2. Prepare payload
    payload = {
        "input_ids": prompt_ids,
        "sampling_params": sampling_params,
        "return_logprob": True,
    }
    
    # 3. Send request to SGLang
    output = await post(url, payload, headers=headers)
    
    # 4. Update sample with response
    new_response_tokens = [item[1] for item in output["meta_info"]["output_token_logprobs"]]
    new_response_log_probs = [item[0] for item in output["meta_info"]["output_token_logprobs"]]
    
    sample.tokens = sample.tokens + new_response_tokens
    sample.response_length += len(new_response_tokens)
    sample.response += output["text"]
    sample.rollout_log_probs += new_response_log_probs
    
    return sample

Sampling Parameters

Training
Evaluation

ROLLOUT_ARGS=(
    # Temperature for sampling (higher = more random)
    --rollout-temperature 1.0
    
    # Top-p (nucleus) sampling
    --rollout-top-p 1.0
    
    # Top-k sampling
    --rollout-top-k -1
    
    # Maximum response length
    --rollout-max-response-len 8192
    
    # Stop sequences
    --rollout-stop "</s>" "<|endoftext|>"
    
    # Stop token IDs
    --rollout-stop-token-ids 128001 128008
)

EVAL_ARGS=(
    # Evaluation-specific sampling (can differ from training)
    --n-samples-per-eval-prompt 16
    --eval-max-response-len 16384
    --eval-top-p 1.0
    --eval-temperature 1.0
)

Custom Generation Functions

For complex workflows like multi-turn interactions or tool use, slime supports custom generation functions.

Basic Custom Function

# custom_generate.py
async def custom_generate(args, sample: Sample, sampling_params) -> Sample:
    """Custom generation function with special logic"""
    
    # Access metadata from sample
    session_id = sample.metadata.get("session_id")
    tools = sample.metadata.get("tools")
    
    # Your custom generation logic
    response = await your_generation_logic(sample.prompt, tools)
    
    # Update sample
    sample.response = response
    sample.tokens = tokenize(sample.prompt + response)
    
    return sample

Multi-Turn Generation

From the quick start guide (lines 476-521):

async def generate(args, sample: Sample, sampling_params) -> Sample:
    prompt = sample.prompt
    full_response = ""
    loss_masks = []
    
    for turn in range(max_turns):
        # 1. Model generates action
        model_output = await call_sglang(prompt + full_response, ...)
        model_tokens = tokenize(model_output)
        
        # Mark model-generated tokens for loss calculation
        loss_masks += [1] * len(model_tokens)
        full_response += model_output
        
        # 2. Parse action
        action, content = parse_action(model_output)
        
        if action == "search":
            # 3. Execute tool
            tool_output = await google_search(content)
            tool_tokens = tokenize(tool_output)
            
            # DO NOT calculate loss on tool outputs
            loss_masks += [0] * len(tool_tokens)
            full_response += tool_output
        
        elif action == "answer":
            break  # End multi-turn interaction
    
    # Update sample with complete trajectory
    sample.response = full_response
    sample.tokens = tokenize(prompt + full_response)
    sample.loss_mask = loss_masks
    
    return sample

Loss Masking is Critical:The loss_mask must be the same length as response_tokens. Set to:

1 for tokens that should contribute to loss (model-generated)
0 for tokens that should NOT contribute to loss (environment/tool outputs)

Enabling Custom Functions

CUSTOM_ARGS=(
    # Path format: module.file:function_name
    --custom-generate-function-path your_module.multiturn_logic.generate
)

Reward Models

slime supports multiple built-in reward model types and custom reward functions.

Built-in Reward Models

From rm_hub/__init__.py:55-92:

DeepScaler (Math)

# Extract answer from response and compare with label
def get_deepscaler_rule_based_reward(response, label):
    # Parse model answer
    if "</think>" in response:
        model_solution = response.split("</think>")[-1]
    elif "###Response" in response:
        model_solution = response.split("###Response")[1]
    else:
        return 0
    
    # Extract boxed answer
    model_answer = extract_answer(model_solution)
    if model_answer is None:
        return 0
    
    # Grade against ground truth
    for ground_truth in processed_ground_truths:
        is_correct = (grade_answer_mathd(model_answer, ground_truth) or 
                     grade_answer_sympy(model_answer, ground_truth))
        if is_correct:
            return 1
    
    return 0

Usage:

--rm-type deepscaler

DAPO (Math)

# DAPO-style math reward (from dapo-math-17k dataset)
--rm-type dapo

Computes correctness score for mathematical reasoning tasks.

Math (veRL-style)

# Simple correct/incorrect for math problems
--rm-type math

Returns 1 if correct, 0 otherwise using veRL’s grading function.

F1 Score

# Token-level F1 score between response and label
--rm-type f1

Useful for tasks requiring partial credit (e.g., entity extraction).

GPQA

# Google-Proof Q&A reward
--rm-type gpqa

For graduate-level science questions.

IFBench

# Instruction-following benchmark reward
--rm-type ifbench

Measures instruction-following capabilities.

Remote Reward Model

# Call external reward model service
--rm-type remote_rm
--rm-url http://your-reward-model-server:8000/evaluate

Sends {"prompt", "response", "label"} to external service.

Custom Reward Functions

# custom_reward.py
async def reward_func(args, sample: Sample, **kwargs) -> float:
    """Custom reward function"""
    
    # Access sample data
    prompt = sample.prompt
    response = sample.response
    label = sample.label
    metadata = sample.metadata
    
    # Your custom reward logic
    if check_format(response):
        format_score = 1.0
    else:
        format_score = 0.0
    
    correctness = compute_correctness(response, label)
    
    # Combine multiple reward signals
    total_reward = 0.7 * correctness + 0.3 * format_score
    
    return total_reward

# Enable custom reward function
--custom-rm-path your_module.custom_reward.reward_func

Group Reward Models

For algorithms like GRPO that need to compare multiple responses:

async def batched_reward(args, samples: list[Sample], **kwargs) -> list[float]:
    """Reward function that processes a group of samples together"""
    
    # All samples share the same prompt
    assert len(set(s.prompt for s in samples)) == 1
    
    responses = [s.response for s in samples]
    
    # Compute relative ranking or pairwise comparisons
    rewards = compute_group_rewards(responses)
    
    return rewards

--group-rm
--custom-rm-path your_module.custom_reward.batched_reward

Dynamic Sampling

Dynamic sampling allows you to filter out low-quality rollout data before training, improving data efficiency.

From the quick start guide (lines 339-370):

ROLLOUT_ARGS=(
    # Over-sample more prompts than needed
    --rollout-batch-size 32
    --over-sampling-batch-size 64
    
    # Filter function to select high-quality samples
    --dynamic-sampling-filter-path \
        slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
)

Built-in Filter: Non-Zero Reward Std

From the quick start guide:

def check_reward_nonzero_std(args, samples: list[Sample], **kwargs):
    """Keep only groups where rewards have non-zero standard deviation"""
    rewards = [sample.get_reward_value(args) for sample in samples]
    keep = torch.tensor(rewards, dtype=torch.float).std() > 0.0
    
    return DynamicFilterOutput(
        keep=keep,
        reason=None if keep else f"zero_std_{round(rewards[0], 1)}",
    )

This ensures diversity in rewards for each prompt group, preventing the model from learning on homogeneous data.

Custom Dynamic Filters

from slime.rollout.filter_hub.base_types import DynamicFilterOutput

def custom_filter(args, samples: list[Sample], **kwargs):
    """Custom filter logic"""
    
    # Example: Keep only if at least one response is correct
    has_correct = any(s.reward > 0.9 for s in samples)
    
    # Example: Filter by response length
    avg_length = sum(s.response_length for s in samples) / len(samples)
    reasonable_length = 50 <= avg_length <= 2000
    
    keep = has_correct and reasonable_length
    
    return DynamicFilterOutput(
        keep=keep,
        reason="filtered" if not keep else None,
    )

Partial Rollout

Partial rollout caches incomplete generations during dynamic sampling, improving computational efficiency.

# Enable partial rollout
--partial-rollout

# Optional: Mask off-policy tokens in partial rollouts
--mask-offpolicy-in-partial-rollout

# Custom buffer extraction strategy
--buffer-filter-path slime.rollout.filter_hub.buffer_filters.pop_first

From sglang_rollout.py:214-217:

# Mark previous off-policy generation for partial rollout
if args.partial_rollout and args.mask_offpolicy_in_partial_rollout:
    if sample.response_length > 0:
        sample.loss_mask = [0] * sample.response_length

From the quick start guide (lines 378-386):

def pop_first(args, rollout_id, buffer: list[list[Sample]], num_samples: int):
    """Extract samples from buffer in FIFO order"""
    num_to_pop = min(len(buffer), num_samples)
    samples = buffer[:num_to_pop]
    del buffer[:num_to_pop]
    return samples

Data Preprocessing

For complex workflows, preprocess your data to include metadata:

import json

# Original data
data = [
    {"question": "...", "answer": "...", "session_id": "...", "tools": [...]}.
]

# Preprocess: Pack extra fields into metadata
processed = []
for item in data:
    processed.append({
        "question": item["question"],
        "answer": item["answer"],
        "metadata": json.dumps({
            "session_id": item["session_id"],
            "tools": item["tools"],
        })
    })

with open("processed.jsonl", "w") as f:
    for item in processed:
        f.write(json.dumps(item) + "\n")

ROLLOUT_ARGS=(
    --prompt-data /path/to/processed.jsonl
    --input-key question
    --label-key answer
    --metadata-key metadata  # slime auto-parses JSON to dict
)

FP8 Inference with BF16 Training

For faster rollout generation, use FP8 quantized weights for inference while training in BF16.

# Download FP8 checkpoint
hf download Qwen/Qwen3-4B-FP8 --local-dir /root/Qwen3-4B-FP8

CKPT_ARGS=(
    # Use FP8 checkpoint for tokenizer and rollout
    --hf-checkpoint /root/Qwen3-4B-FP8
    
    # Use BF16 checkpoint for training
    --ref-load /root/Qwen3-4B_torch_dist
    --load /root/Qwen3-4B_slime/
)

slime automatically casts BF16 weights to FP8 during weight synchronization.

Training Loop

Learn about the complete training cycle

Algorithms

Understand advantage estimation methods

Get Started

Core Concepts

Guides

Advanced

Platform Support

Rollout & Reward

Rollout Generation

Basic Rollout Flow

Sampling Parameters

Custom Generation Functions

Basic Custom Function

Multi-Turn Generation

Enabling Custom Functions

Reward Models

Built-in Reward Models

Custom Reward Functions

Group Reward Models

Dynamic Sampling

Built-in Filter: Non-Zero Reward Std

Custom Dynamic Filters

Partial Rollout

Data Preprocessing

FP8 Inference with BF16 Training

Training Loop

Algorithms

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Advanced

Platform Support

Documentation Index

​Rollout Generation

​Basic Rollout Flow

​Sampling Parameters

​Custom Generation Functions

​Basic Custom Function

​Multi-Turn Generation

​Enabling Custom Functions

​Reward Models

​Built-in Reward Models

​Custom Reward Functions

​Group Reward Models

​Dynamic Sampling

​Built-in Filter: Non-Zero Reward Std

​Custom Dynamic Filters

​Partial Rollout

​Data Preprocessing

​FP8 Inference with BF16 Training

​Related Topics

Training Loop

Algorithms

Build docs developers (and LLMs) love

Rollout Generation

Basic Rollout Flow

Sampling Parameters

Custom Generation Functions

Basic Custom Function

Multi-Turn Generation

Enabling Custom Functions

Reward Models

Built-in Reward Models

Custom Reward Functions

Group Reward Models

Dynamic Sampling

Built-in Filter: Non-Zero Reward Std

Custom Dynamic Filters

Partial Rollout

Data Preprocessing

FP8 Inference with BF16 Training

Related Topics