Rollout generation is the process of using the current policy to generate new training data. slime uses SGLang for high-throughput inference with advanced features like dynamic sampling and partial rollouts.
# Extract answer from response and compare with labeldef get_deepscaler_rule_based_reward(response, label): # Parse model answer if "</think>" in response: model_solution = response.split("</think>")[-1] elif "###Response" in response: model_solution = response.split("###Response")[1] else: return 0 # Extract boxed answer model_answer = extract_answer(model_solution) if model_answer is None: return 0 # Grade against ground truth for ground_truth in processed_ground_truths: is_correct = (grade_answer_mathd(model_answer, ground_truth) or grade_answer_sympy(model_answer, ground_truth)) if is_correct: return 1 return 0
Usage:
--rm-type deepscaler
DAPO (Math)
# DAPO-style math reward (from dapo-math-17k dataset)--rm-type dapo
Computes correctness score for mathematical reasoning tasks.
Math (veRL-style)
# Simple correct/incorrect for math problems--rm-type math
Returns 1 if correct, 0 otherwise using veRL’s grading function.
F1 Score
# Token-level F1 score between response and label--rm-type f1
Useful for tasks requiring partial credit (e.g., entity extraction).
For algorithms like GRPO that need to compare multiple responses:
async def batched_reward(args, samples: list[Sample], **kwargs) -> list[float]: """Reward function that processes a group of samples together""" # All samples share the same prompt assert len(set(s.prompt for s in samples)) == 1 responses = [s.response for s in samples] # Compute relative ranking or pairwise comparisons rewards = compute_group_rewards(responses) return rewards
def check_reward_nonzero_std(args, samples: list[Sample], **kwargs): """Keep only groups where rewards have non-zero standard deviation""" rewards = [sample.get_reward_value(args) for sample in samples] keep = torch.tensor(rewards, dtype=torch.float).std() > 0.0 return DynamicFilterOutput( keep=keep, reason=None if keep else f"zero_std_{round(rewards[0], 1)}", )
This ensures diversity in rewards for each prompt group, preventing the model from learning on homogeneous data.
from slime.rollout.filter_hub.base_types import DynamicFilterOutputdef custom_filter(args, samples: list[Sample], **kwargs): """Custom filter logic""" # Example: Keep only if at least one response is correct has_correct = any(s.reward > 0.9 for s in samples) # Example: Filter by response length avg_length = sum(s.response_length for s in samples) / len(samples) reasonable_length = 50 <= avg_length <= 2000 keep = has_correct and reasonable_length return DynamicFilterOutput( keep=keep, reason="filtered" if not keep else None, )
# Mark previous off-policy generation for partial rolloutif args.partial_rollout and args.mask_offpolicy_in_partial_rollout: if sample.response_length > 0: sample.loss_mask = [0] * sample.response_length
From the quick start guide (lines 378-386):
def pop_first(args, rollout_id, buffer: list[list[Sample]], num_samples: int): """Extract samples from buffer in FIFO order""" num_to_pop = min(len(buffer), num_samples) samples = buffer[:num_to_pop] del buffer[:num_to_pop] return samples
For faster rollout generation, use FP8 quantized weights for inference while training in BF16.
# Download FP8 checkpointhf download Qwen/Qwen3-4B-FP8 --local-dir /root/Qwen3-4B-FP8CKPT_ARGS=( # Use FP8 checkpoint for tokenizer and rollout --hf-checkpoint /root/Qwen3-4B-FP8 # Use BF16 checkpoint for training --ref-load /root/Qwen3-4B_torch_dist --load /root/Qwen3-4B_slime/)
slime automatically casts BF16 weights to FP8 during weight synchronization.